Re: [Rdkit-discuss] HasSubstructMatch doesn't work as expected

2017-09-13 Thread Greg Landrum
This isn't a really straightforward one.

One solution (and I think the best one) is to change the aromaticity model
used to match whatever is generating the hits (in your case it's the Symyx
cartridge).
The RDKit has functionality to do this already when you call the
SetAromaticity() function:

In [29]: m2 = Chem.MolFromMolFile('./CHEMBL25.mol',sanitize=False)

In [30]: Chem.SanitizeMol(m2,Chem.SANITIZE_ALL^Chem.SANITIZE_SETAROMATICITY)
Out[30]: rdkit.Chem.rdmolops.SanitizeFlags.SANITIZE_NONE

In [31]: Chem.SetAromaticity(m2,Chem.AROMATICITY_SIMPLE)


The problem here is that there isn't an aromaticity model there for
MDL/Symyx. This would be a useful thing to have and can be done quickly. If
someone can describe the aromaticity model to me, or point me to a
description of it, I can add it for the next release (which happens soon).

Another solution that I think would work is to read the query molecule in
without doing aromaticity perception (see input line 30 above) and then to
convert all the bonds to either single-or-aromatic or double-or-aromatic
queries using the approaches described here:
http://rdkit.blogspot.ch/2015/08/tuning-substructure-queries.html
and here:
http://rdkit.blogspot.ch/2016/07/tuning-substructure-queries-ii.html

Unfortunately the AdjustQueryParameters function doesn't have anything that
helps with the kind of bond queries you need, so you'd need to make the
bond changes in your code. If you want to go down this road and it's not
clear how to do so, let me know and I can post some sample code. I'm afraid
it's not completely trivial with bond queries

-greg



On Wed, Sep 13, 2017 at 4:42 PM, Michał Nowotka  wrote:

> Is there any flag in RDkit to match both 'normal' aspirin and embedded
> aromatic analogues?
> The problem is that I can't modify user queries by hand in real time :)
>
> On Wed, Sep 13, 2017 at 2:12 PM, Chris Earnshaw 
> wrote:
> > Hi
> >
> > The problem is due to RDkit perceiving the embedded pyranone in
> > CHEMBL1999443 as an aromatic system, which is probably correct. However,
> in
> > the structure of aspirin the carboxyl carbon and singly bonded oxygen are
> > non-aromatic, so if you just use the SMILES of aspirin as a query it
> won't
> > match CHEMBL1999443
> >
> > You'll need to use a slightly more generic aspirin-like query to allow
> the
> > possibility of matching both 'normal' aspirin and embedded aromatic
> > analogues. CC(=O)Oc1c1[#6](=O)[#8] should work OK.
> >
> > Regards,
> > Chris
> >
> > On 13 September 2017 at 13:40, Michał Nowotka  wrote:
> >>
> >> Hi,
> >>
> >> This problem is probably due to my lack of chemistry knowledge but
> >> plese have a look:
> >>
> >> If I do a substructure search in ChEMBL using aspirin (CHEMBL25) as a
> >> query (ChEMBL API uses the Symix catridge):
> >>
> >> from chembl_webresource_client.new_client import new_client
> >> res = new_client.substructure.filter(chembl_id='CHEMBL25')
> >>
> >> One of them will be CHEMBL1999443:
> >>
> >> 'CHEMBL1999443' in (r['molecule_chembl_id'] for r in res)
> >> >>> True
> >>
> >> Now I take the molfile:
> >>
> >> new_client.molecule.set_format('mol')
> >> mol = new_client.molecule.get('CHEMBL1999443')
> >>
> >> and load it with aspirin into rdkit:
> >>
> >> from rdkit import Chem
> >> m = Chem.MolFromMolBlock(mol)
> >> pattern = Chem.MolFromMolBlock(new_client.molecule.get('CHEMBL25'))
> >>
> >> If I check if it has an aspirin as a substructure using rdkit, I'm
> >> getting false...
> >>
> >> m.HasSubstructMatch(pattern)
> >> >>> False
> >>
> >> Looking at this blog post:
> >>
> >> https://github.com/rdkit/rdkit-tutorials/blob/master/
> notebooks/002_SMARTS_SubstructureMatching.ipynb
> >> I tried to initialize rings and retry:
> >>
> >>  Chem.GetSymmSSSR(m)
> >>  m.HasSubstructMatch(pattern)
> >>  >>>False
> >>
> >> Chem.GetSymmSSSR(pattern)
> >> m.HasSubstructMatch(pattern)
> >> >>>False
> >>
> >> But as you can see without any luck. Is there anything else I can do
> >> to get the match anyway?
> >> Without having a match I can't aligh and higlight asprin substructure
> >> in CHEMBL1999443 image using GenerateDepictionMatching2DStructure and
> >> DrawMolecule functions.
> >>
> >> Kind regards,
> >>
> >> Michał Nowotka
> >>
> >>
> >> 
> --
> >> Check out the vibrant tech community on one of the world's most
> >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> >> ___
> >> Rdkit-discuss mailing list
> >> Rdkit-discuss@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> >
> >
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-disc

Re: [Rdkit-discuss] Fwd: Re: HasSubstructMatch doesn't work as expected

2017-09-13 Thread Peter S. Shenkin
It can,  but you have to tell it how. It can't read your mind. Give it a
SMILES and either an atom list or a SMARTS that specifies what you want.

-P.
Sent from a cell phone. Please forgive brvty and m1St@kes.

On Sep 13, 2017 4:42 PM, "Michał Nowotka"  wrote:

> True, but I'm only getting molfiles instead.
> My very naive assumption was that if I'm able to highlight the
> structure manually (prinint out resulting structures images and
> highliting the query substructure using pen) then rdkit should be able
> to do the same thing.
>
> On Wed, Sep 13, 2017 at 9:36 PM, Peter S. Shenkin 
> wrote:
> > I neglected to cc Rdkit on this earlier. If he can get the matching atom
> > list from their other program, he won't have to mess w. SMARTS matching
> in
> > Rdkit.
> >
> > -P.
> > Sent from a cell phone. Please forgive brvty and m1St@kes.
> > -- Forwarded message --
> > From: "Peter S. Shenkin" 
> > Date: Sep 13, 2017 3:15 PM
> > Subject: Re: [Rdkit-discuss] HasSubstructMatch doesn't work as expected
> > To: "Michał Nowotka" 
> > Cc:
> >
> > Well, depending on how the substructure results from the other program
> are
> > presented, you might not have to deal with SMARTS matching at all
> yourself.
> > For example, if you have a SMILES for the structure and a list of atom
> > indices into that SMILES that constitute the matching substructure (where
> > the first atom in the SMILES has index 0), you can do the following:
> >
> > from rdkit import Chem
> > from rdkit.Chem import Draw
> >
> > smi = 'Oc1c1' # Assume a SMILES
> > matching_atoms = [0, 1] # Assume a list of matching atoms
> > mol = Chem.MolFromSmiles(smi)
> > x = Draw.MolToImage(mol,highlightAtoms=(0,1))
> > display(x)
> >
> >
> > See attached for the image, from a Jupyter notebook.
> >
> > If, on the other hand, you have to work from SMARTS, then it seems to me
> > that you need to understand something about how SMARTS works, and you
> have
> > to understand the needed chemical concepts, or at least interact with
> > someone who does.
> >
> > Otherwise, it's a bit like trying to do complicated substring matches
> using
> > regular expressions, without knowing how regular expressions work.
> >
> > -
> > P.
> >
> >
> > On Sep 13, 2017 12:12 PM, "Michał Nowotka"  wrote:
> >>
> >> OK, so what I have is some substructure results from other (non-rdkit)
> >> cartridge and I want to use rdkit to generate images of all results
> >> with the query substracture highlighed and aligned.
> >> So I have two things: a list of compounds and a query compound.
> >> Now I need to highlight the query compound for every compound from the
> >> list and I need to do it at all costs. I can't leave any compound not
> >> highlighted even if rdkit by default has a different opinion weather
> >> the query compound really is a true substructure of a given compound.
> >>
> >> So how can I instruct rdkit to ignore aromacity and other factors,
> >> preferably one by one, each time going one level deeper where the last
> >> resort would be simply matching on the level of two planar graphs. Is
> >> that possible?
> >>
> >> On Wed, Sep 13, 2017 at 4:48 PM, Peter S. Shenkin 
> >> wrote:
> >> > Your course of action depends upon just what you are really trying to
> >> > do. If
> >> > it's only aspirin, then why wouldn't you just do it manually? If it
> goes
> >> > beyond aspirin, you have to start by defining in general terms exactly
> >> > what
> >> > you want to match to what.
> >> >
> >> > For example, given a query molecule (aspirin in this case), if you
> want
> >> > all
> >> > its non-aromatic atoms to match aromatic as well as non-aromatic atoms
> >> > in
> >> > the database, you could write a string-alteration routine to munge the
> >> > SMILES of a query molecule into a SMARTS that would do just that, and
> >> > then
> >> > use that SMARTS to match your database molecules. Repeat for each
> query
> >> > molecule.
> >> >
> >> > But you have to start with a precise definition of just what kind of
> >> > matching you wish to do. For instance, maybe you don't really want
> >> > non-aromatic ring atoms in your query to match aromatic rings and vice
> >> > versa
> >> > (i.e., a cyclohexyl to match a phenyl); maybe you only want non-ring
> >> > atoms
> >> > in the query to match aliphatic as well as aromatic substructures. And
> >> > so
> >> > on.
> >> >
> >> > -P.
> >> >
> >> >
> >> > On Wed, Sep 13, 2017 at 10:42 AM, Michał Nowotka 
> >> > wrote:
> >> >>
> >> >> Is there any flag in RDkit to match both 'normal' aspirin and
> embedded
> >> >> aromatic analogues?
> >> >> The problem is that I can't modify user queries by hand in real time
> :)
> >> >>
> >> >> On Wed, Sep 13, 2017 at 2:12 PM, Chris Earnshaw <
> cgearns...@gmail.com>
> >> >> wrote:
> >> >> > Hi
> >> >> >
> >> >> > The problem is due to RDkit perceiving the embedded pyranone in
> >> >> > CHEMBL1999443 as an aromatic system, which is probably correct.
> >> >> > However,
> >> >> > in
> >> >> > the structur

[Rdkit-discuss] Fwd: Re: HasSubstructMatch doesn't work as expected

2017-09-13 Thread Peter S. Shenkin
I neglected to cc Rdkit on this earlier. If he can get the matching atom
list from their other program, he won't have to mess w. SMARTS matching in
Rdkit.

-P.
Sent from a cell phone. Please forgive brvty and m1St@kes.
-- Forwarded message --
From: "Peter S. Shenkin" 
Date: Sep 13, 2017 3:15 PM
Subject: Re: [Rdkit-discuss] HasSubstructMatch doesn't work as expected
To: "Michał Nowotka" 
Cc:

​Well, depending on how the substructure results from the other program are
presented, you might not have to deal with SMARTS matching at all yourself.
For example, if you have a SMILES for the structure and a list of atom
indices into that SMILES that constitute the matching substructure (where
the first atom in the SMILES has index 0), you can do the following:

from rdkit import Chem
from rdkit.Chem import Draw

smi = 'Oc1c1' # Assume a SMILES
matching_atoms = [0, 1] # Assume a list of matching atoms
mol = Chem.MolFromSmiles(smi)
x = Draw.MolToImage(mol,highlightAtoms=(0,1))
display(x)


​See attached for the image, from a Jupyter notebook.

If, on the other hand, you have to work from SMARTS, then it seems to me
that you need to understand something about how SMARTS works, and you have
to understand the needed chemical concepts, or at least interact with
someone who does.

Otherwise, it's a bit like trying to do complicated substring matches using
regular expressions, without knowing how regular expressions work.

-
​
P.​


On Sep 13, 2017 12:12 PM, "Michał Nowotka"  wrote:

> OK, so what I have is some substructure results from other (non-rdkit)
> cartridge and I want to use rdkit to generate images of all results
> with the query substracture highlighed and aligned.
> So I have two things: a list of compounds and a query compound.
> Now I need to highlight the query compound for every compound from the
> list and I need to do it at all costs. I can't leave any compound not
> highlighted even if rdkit by default has a different opinion weather
> the query compound really is a true substructure of a given compound.
>
> So how can I instruct rdkit to ignore aromacity and other factors,
> preferably one by one, each time going one level deeper where the last
> resort would be simply matching on the level of two planar graphs. Is
> that possible?
>
> On Wed, Sep 13, 2017 at 4:48 PM, Peter S. Shenkin 
> wrote:
> > Your course of action depends upon just what you are really trying to
> do. If
> > it's only aspirin, then why wouldn't you just do it manually? If it goes
> > beyond aspirin, you have to start by defining in general terms exactly
> what
> > you want to match to what.
> >
> > For example, given a query molecule (aspirin in this case), if you want
> all
> > its non-aromatic atoms to match aromatic as well as non-aromatic atoms in
> > the database, you could write a string-alteration routine to munge the
> > SMILES of a query molecule into a SMARTS that would do just that, and
> then
> > use that SMARTS to match your database molecules. Repeat for each query
> > molecule.
> >
> > But you have to start with a precise definition of just what kind of
> > matching you wish to do. For instance, maybe you don't really want
> > non-aromatic ring atoms in your query to match aromatic rings and vice
> versa
> > (i.e., a cyclohexyl to match a phenyl); maybe you only want non-ring
> atoms
> > in the query to match aliphatic as well as aromatic substructures. And so
> > on.
> >
> > -P.
> >
> >
> > On Wed, Sep 13, 2017 at 10:42 AM, Michał Nowotka 
> wrote:
> >>
> >> Is there any flag in RDkit to match both 'normal' aspirin and embedded
> >> aromatic analogues?
> >> The problem is that I can't modify user queries by hand in real time :)
> >>
> >> On Wed, Sep 13, 2017 at 2:12 PM, Chris Earnshaw 
> >> wrote:
> >> > Hi
> >> >
> >> > The problem is due to RDkit perceiving the embedded pyranone in
> >> > CHEMBL1999443 as an aromatic system, which is probably correct.
> However,
> >> > in
> >> > the structure of aspirin the carboxyl carbon and singly bonded oxygen
> >> > are
> >> > non-aromatic, so if you just use the SMILES of aspirin as a query it
> >> > won't
> >> > match CHEMBL1999443
> >> >
> >> > You'll need to use a slightly more generic aspirin-like query to allow
> >> > the
> >> > possibility of matching both 'normal' aspirin and embedded aromatic
> >> > analogues. CC(=O)Oc1c1[#6](=O)[#8] should work OK.
> >> >
> >> > Regards,
> >> > Chris
> >> >
> >> > On 13 September 2017 at 13:40, Michał Nowotka 
> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> This problem is probably due to my lack of chemistry knowledge but
> >> >> plese have a look:
> >> >>
> >> >> If I do a substructure search in ChEMBL using aspirin (CHEMBL25) as a
> >> >> query (ChEMBL API uses the Symix catridge):
> >> >>
> >> >> from chembl_webresource_client.new_client import new_client
> >> >> res = new_client.substructure.filter(chembl_id='CHEMBL25')
> >> >>
> >> >> One of them will be CHEMBL1999443:
> >> >>
> >> >> 'CHEMBL19

Re: [Rdkit-discuss] Non-redundant database of molecules

2017-09-13 Thread Malitha Kabir
Hi Wandré,

1) apt-get installs rdkit 2013 (link below). So, please install it through
conda (as Markus suggested)
https://packages.ubuntu.com/trusty/python/python-rdkit

2) I am not familiar with the case of wrong SMILE generation. But the link
below says something more that I think you need to know.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3495655/

3) As you are trying to store data, it would be great to consider whether
you are storing energy minimized molecule or not. (my opinion). Surface
area related descriptors will yield different result and bond connectivity
related descriptor will yield same result in both cases.

4) Sharing my personal experience, during my undergraduate school part of
my final year project was stressed up with conceptual questions. I failed
to utilize the  blessing of advanced development due to the lack of time.
The later experience was not so good.

Please keep in mind that we can generate a non redundant database with few
molecules but for millions of molecules it should be quite though task.
Have a great day!

- malitha




On Thu, Sep 14, 2017 at 2:05 AM, Markus Sitzmann 
wrote:

> PS. The conda version has InChI support
>
> On Wed, Sep 13, 2017 at 10:04 PM, Markus Sitzmann <
> markus.sitzm...@gmail.com> wrote:
>
>> Strong recommendation: use the conda version:
>>
>> http://www.rdkit.org/docs/Install.html
>>
>> On Wed, Sep 13, 2017 at 9:58 PM, Wandré  wrote:
>>
>>> I just run sudo apt-get install python-rdkit librdkit1 rdkit-data 😁
>>> I'm trying to solve this with this link: http://www.blopig.com/bl
>>> og/2013/02/how-to-install-rdkit-on-ubuntu-12-04/
>>>
>>> --
>>> Wandré Nunes de Pinho Veloso
>>> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
>>> Doutorando em Bioinformática - Universidade Federal de Minas Gerais -
>>> UFMG
>>> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
>>> Inteligência Computacional - UNIFEI
>>> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
>>> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
>>> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>>>
>>> 2017-09-13 16:55 GMT-03:00 Markus Sitzmann :
>>>
 How did you install rdkit so far? And where? Is it the conda/anaconda
 version?

 On Wed, Sep 13, 2017 at 9:39 PM, Wandré  wrote:

> How to install RDKit with InChI?
> When I run Chem.inchi.INCHI_AVAILABLE, the result is False
>
> --
> Wandré Nunes de Pinho Veloso
> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
> Doutorando em Bioinformática - Universidade Federal de Minas Gerais -
> UFMG
> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
> Inteligência Computacional - UNIFEI
> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>
> 2017-09-13 16:30 GMT-03:00 Wandré :
>
>> Thanks Malitha.
>> I choose this descriptors because I will store this on my database,
>> so, will be fast compare one molecule before insert them in database.
>> My worry now is if the RDKit will generate different SMILES or InChI
>> in same SDF molecule or equals in different molecules (molecules from 
>> RCSB
>> PDB, PubChem, ChemBL, for example).
>>
>> --
>> Wandré Nunes de Pinho Veloso
>> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
>> Doutorando em Bioinformática - Universidade Federal de Minas Gerais -
>> UFMG
>> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
>> Inteligência Computacional - UNIFEI
>> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
>> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
>> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>>
>> 2017-09-13 16:22 GMT-03:00 Malitha Kabir :
>>
>>> Hi Wandré,
>>>
>>> It seems you already did intense research on it. Kindly accept my
>>> comments as an addition to your idea (not the answer you trying to find
>>> out). In my idea, categorizing molecules using it's descriptor should
>>> reduce computation time. RDKit currently offer calculation of about 200
>>> descriptors! So, a careful look up at those makes a lot of sense to me.
>>> Conceptually, descriptor matching should follow a sequence (I don't know
>>> what sequence would be ideal) - for example MolWt should match first (H
>>> contribution and ions should be taken into consideration here) and then
>>> subsequent matching of other descriptors (might be different while 
>>> writing
>>> programs). There are a few reading materials on molecular fingerprint 
>>> and
>>> database schema. You may have a look at those.
>>>
>>> The links are from Daylight. I am neither involved with the company
>>> nor their product.
>>> http://ww

Re: [Rdkit-discuss] Non-redundant database of molecules (Wandr?)

2017-09-13 Thread Markus Sitzmann
If you do nothing else (on purpose), SMILES *calculated* by RDKit from any
input are canonical per se (BUT that is only true if you compare it to
other SMILES also calculated by RDKit, you can not compare SMILES between
software packages even if they canonical in the domain of each of the
software packages).

On Wed, Sep 13, 2017 at 9:16 PM, Wandré  wrote:

> Why don't use the InChI function on RDKit?
> Canonical SMILES cannot be generated by RDKit, correct?
>
> --
> Wandré Nunes de Pinho Veloso
> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
> Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
> Inteligência Computacional - UNIFEI
> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>
> 2017-09-13 15:57 GMT-03:00 Chris Swain :
>
>> Hi,
>>
>> I’d use a text based version of the structure InChiKey or canonical
>> SMILES it then becomes a easy task to do the comparison in Python
>>
>> I wrote a script to do this in Vortex but it should be easy to modify.
>> https://www.macinchem.org/reviews/vortex/tut28/scripting_vortex28.php
>>
>>
>> Cheers
>>
>> Chris
>>
>>
>>
>> Today's Topics:
>>
>>   1. Non-redundant database of molecules (Wandr?)
>>
>>
>> --
>>
>> Message: 1
>> Date: Wed, 13 Sep 2017 07:13:56 -0300
>> From: Wandr? 
>> To: rdkit-discuss@lists.sourceforge.net
>> Subject: [Rdkit-discuss] Non-redundant database of molecules
>> Message-ID:
>> 
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hi,
>>
>> My name is Wandr? and I'm from Brazil.
>> I'm trying to do a big database of molecules, but, I want to eliminate all
>> the redundant molecules before insert them in database.
>> I want to know what is the best method to identify one molecule in RDKit.
>> Is SMILES ("Chem.MolToSmiles(mol,isomericSmiles=True)") or I will need to
>> compare all molecules, one by one, before insert them in database (using
>> Tanimoto)?
>> This can be hard to do because my database will have lot of millions of
>> molecules, so, compare one by one before insert is the only answer?
>> Compare if the SMILES as already inserted is easy (text compare), but,
>> compare fingerprint of molecule...
>>
>> If I really need to compare the fingerprint of molecule, how to store this
>> data in PostgreSQL without use cartridge? I will generate the fingeprint
>> (Atompair, for example) and store this fingerprint in database and compare
>> all the fingerprints, one by one, before insert a now molecule. This
>> fingerprint (Atompair) have lot of features, so, store this in relational
>> database is expensive.
>> It is possible?
>>
>> Thanks!
>>
>> --
>> Wandr? Nunes de Pinho Veloso
>> Professor Assistente - Unifei - Campus Avan?ado de Itabira-MG
>> Doutorando em Bioinform?tica - Universidade Federal de Minas Gerais - UFMG
>> Pesquisador do INSILICO - Grupo Interdisciplinar em Simula??o e
>> Intelig?ncia Computacional - UNIFEI
>> Membro do Grupo de Pesquisa Assinaturas Biol?gicas da FIOCRUZ
>> Membro do Grupo de Pesquisa Bioinform?tica Estrutural da UFMG
>> Laborat?rio de Bioinform?tica e Sistemas - LBS, DCC, UFMG
>> -- next part --
>> An HTML attachment was scrubbed...
>>
>> --
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>
>> --
>>
>> Subject: Digest Footer
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>> --
>>
>> End of Rdkit-discuss Digest, Vol 119, Issue 20
>> **
>>
>>
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one

Re: [Rdkit-discuss] Non-redundant database of molecules

2017-09-13 Thread Markus Sitzmann
PS. The conda version has InChI support

On Wed, Sep 13, 2017 at 10:04 PM, Markus Sitzmann  wrote:

> Strong recommendation: use the conda version:
>
> http://www.rdkit.org/docs/Install.html
>
> On Wed, Sep 13, 2017 at 9:58 PM, Wandré  wrote:
>
>> I just run sudo apt-get install python-rdkit librdkit1 rdkit-data 😁
>> I'm trying to solve this with this link: http://www.blopig.com/bl
>> og/2013/02/how-to-install-rdkit-on-ubuntu-12-04/
>>
>> --
>> Wandré Nunes de Pinho Veloso
>> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
>> Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
>> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
>> Inteligência Computacional - UNIFEI
>> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
>> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
>> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>>
>> 2017-09-13 16:55 GMT-03:00 Markus Sitzmann :
>>
>>> How did you install rdkit so far? And where? Is it the conda/anaconda
>>> version?
>>>
>>> On Wed, Sep 13, 2017 at 9:39 PM, Wandré  wrote:
>>>
 How to install RDKit with InChI?
 When I run Chem.inchi.INCHI_AVAILABLE, the result is False

 --
 Wandré Nunes de Pinho Veloso
 Professor Assistente - Unifei - Campus Avançado de Itabira-MG
 Doutorando em Bioinformática - Universidade Federal de Minas Gerais -
 UFMG
 Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
 Inteligência Computacional - UNIFEI
 Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
 Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
 Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG

 2017-09-13 16:30 GMT-03:00 Wandré :

> Thanks Malitha.
> I choose this descriptors because I will store this on my database,
> so, will be fast compare one molecule before insert them in database.
> My worry now is if the RDKit will generate different SMILES or InChI
> in same SDF molecule or equals in different molecules (molecules from RCSB
> PDB, PubChem, ChemBL, for example).
>
> --
> Wandré Nunes de Pinho Veloso
> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
> Doutorando em Bioinformática - Universidade Federal de Minas Gerais -
> UFMG
> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
> Inteligência Computacional - UNIFEI
> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>
> 2017-09-13 16:22 GMT-03:00 Malitha Kabir :
>
>> Hi Wandré,
>>
>> It seems you already did intense research on it. Kindly accept my
>> comments as an addition to your idea (not the answer you trying to find
>> out). In my idea, categorizing molecules using it's descriptor should
>> reduce computation time. RDKit currently offer calculation of about 200
>> descriptors! So, a careful look up at those makes a lot of sense to me.
>> Conceptually, descriptor matching should follow a sequence (I don't know
>> what sequence would be ideal) - for example MolWt should match first (H
>> contribution and ions should be taken into consideration here) and then
>> subsequent matching of other descriptors (might be different while 
>> writing
>> programs). There are a few reading materials on molecular fingerprint and
>> database schema. You may have a look at those.
>>
>> The links are from Daylight. I am neither involved with the company
>> nor their product.
>> http://www.daylight.com/dayhtml/doc/theory/theory.finger.html
>> http://www.daylight.com/dayhtml/doc/theory/theory.thor.html
>>
>> Best regards,
>> - malitha
>>
>>
>> On Thu, Sep 14, 2017 at 12:43 AM, Wandré 
>> wrote:
>>
>>> Thanks for all the answers.
>>>
>>> Reading all answers, I think in something different... If the SMILES
>>> (Chem.MolToSmiles(mol,isomericSmiles=True)) and Inchi
>>> (Chem.MolToInchi(mol)) can generate the same value in different 
>>> molecules,
>>> I will generate others descriptors (NumHDonors, NumHAcceptors, Ri
>>> ngCount, GetNumAtoms, TPSA, pyLabuteASA, MolWt, CalcNumRotatableBonds
>>> and MolLogP) to compare all the molecules that SMILES and Inchi are the
>>> same.
>>> If all this data are the same, I will generate the fingerprint
>>> (Atompair for exemple) and use Tanimoto coefficient and, if this value,
>>> when I compare two molecules, is 1, this molecules are the same.
>>>
>>> Where is my mistake (I think that is, one or more, mistakes)?
>>>
>>> Thanks!
>>>
>>> --
>>> Wandré Nunes de Pinho Veloso
>>> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
>>> Doutorando em Bioinformática - Universidade 

Re: [Rdkit-discuss] Non-redundant database of molecules

2017-09-13 Thread Markus Sitzmann
Strong recommendation: use the conda version:

http://www.rdkit.org/docs/Install.html

On Wed, Sep 13, 2017 at 9:58 PM, Wandré  wrote:

> I just run sudo apt-get install python-rdkit librdkit1 rdkit-data 😁
> I'm trying to solve this with this link: http://www.blopig.com/
> blog/2013/02/how-to-install-rdkit-on-ubuntu-12-04/
>
> --
> Wandré Nunes de Pinho Veloso
> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
> Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
> Inteligência Computacional - UNIFEI
> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>
> 2017-09-13 16:55 GMT-03:00 Markus Sitzmann :
>
>> How did you install rdkit so far? And where? Is it the conda/anaconda
>> version?
>>
>> On Wed, Sep 13, 2017 at 9:39 PM, Wandré  wrote:
>>
>>> How to install RDKit with InChI?
>>> When I run Chem.inchi.INCHI_AVAILABLE, the result is False
>>>
>>> --
>>> Wandré Nunes de Pinho Veloso
>>> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
>>> Doutorando em Bioinformática - Universidade Federal de Minas Gerais -
>>> UFMG
>>> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
>>> Inteligência Computacional - UNIFEI
>>> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
>>> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
>>> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>>>
>>> 2017-09-13 16:30 GMT-03:00 Wandré :
>>>
 Thanks Malitha.
 I choose this descriptors because I will store this on my database, so,
 will be fast compare one molecule before insert them in database.
 My worry now is if the RDKit will generate different SMILES or InChI in
 same SDF molecule or equals in different molecules (molecules from RCSB
 PDB, PubChem, ChemBL, for example).

 --
 Wandré Nunes de Pinho Veloso
 Professor Assistente - Unifei - Campus Avançado de Itabira-MG
 Doutorando em Bioinformática - Universidade Federal de Minas Gerais -
 UFMG
 Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
 Inteligência Computacional - UNIFEI
 Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
 Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
 Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG

 2017-09-13 16:22 GMT-03:00 Malitha Kabir :

> Hi Wandré,
>
> It seems you already did intense research on it. Kindly accept my
> comments as an addition to your idea (not the answer you trying to find
> out). In my idea, categorizing molecules using it's descriptor should
> reduce computation time. RDKit currently offer calculation of about 200
> descriptors! So, a careful look up at those makes a lot of sense to me.
> Conceptually, descriptor matching should follow a sequence (I don't know
> what sequence would be ideal) - for example MolWt should match first (H
> contribution and ions should be taken into consideration here) and then
> subsequent matching of other descriptors (might be different while writing
> programs). There are a few reading materials on molecular fingerprint and
> database schema. You may have a look at those.
>
> The links are from Daylight. I am neither involved with the company
> nor their product.
> http://www.daylight.com/dayhtml/doc/theory/theory.finger.html
> http://www.daylight.com/dayhtml/doc/theory/theory.thor.html
>
> Best regards,
> - malitha
>
>
> On Thu, Sep 14, 2017 at 12:43 AM, Wandré 
> wrote:
>
>> Thanks for all the answers.
>>
>> Reading all answers, I think in something different... If the SMILES
>> (Chem.MolToSmiles(mol,isomericSmiles=True)) and Inchi
>> (Chem.MolToInchi(mol)) can generate the same value in different 
>> molecules,
>> I will generate others descriptors (NumHDonors, NumHAcceptors, Ri
>> ngCount, GetNumAtoms, TPSA, pyLabuteASA, MolWt, CalcNumRotatableBonds
>> and MolLogP) to compare all the molecules that SMILES and Inchi are the
>> same.
>> If all this data are the same, I will generate the fingerprint
>> (Atompair for exemple) and use Tanimoto coefficient and, if this value,
>> when I compare two molecules, is 1, this molecules are the same.
>>
>> Where is my mistake (I think that is, one or more, mistakes)?
>>
>> Thanks!
>>
>> --
>> Wandré Nunes de Pinho Veloso
>> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
>> Doutorando em Bioinformática - Universidade Federal de Minas Gerais -
>> UFMG
>> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
>> Inteligência Computacional - UNIFEI
>> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
>

Re: [Rdkit-discuss] Non-redundant database of molecules

2017-09-13 Thread Wandré
I just run sudo apt-get install python-rdkit librdkit1 rdkit-data 😁
I'm trying to solve this with this link:
http://www.blopig.com/blog/2013/02/how-to-install-rdkit-on-ubuntu-12-04/

--
Wandré Nunes de Pinho Veloso
Professor Assistente - Unifei - Campus Avançado de Itabira-MG
Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
Inteligência Computacional - UNIFEI
Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG

2017-09-13 16:55 GMT-03:00 Markus Sitzmann :

> How did you install rdkit so far? And where? Is it the conda/anaconda
> version?
>
> On Wed, Sep 13, 2017 at 9:39 PM, Wandré  wrote:
>
>> How to install RDKit with InChI?
>> When I run Chem.inchi.INCHI_AVAILABLE, the result is False
>>
>> --
>> Wandré Nunes de Pinho Veloso
>> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
>> Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
>> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
>> Inteligência Computacional - UNIFEI
>> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
>> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
>> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>>
>> 2017-09-13 16:30 GMT-03:00 Wandré :
>>
>>> Thanks Malitha.
>>> I choose this descriptors because I will store this on my database, so,
>>> will be fast compare one molecule before insert them in database.
>>> My worry now is if the RDKit will generate different SMILES or InChI in
>>> same SDF molecule or equals in different molecules (molecules from RCSB
>>> PDB, PubChem, ChemBL, for example).
>>>
>>> --
>>> Wandré Nunes de Pinho Veloso
>>> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
>>> Doutorando em Bioinformática - Universidade Federal de Minas Gerais -
>>> UFMG
>>> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
>>> Inteligência Computacional - UNIFEI
>>> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
>>> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
>>> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>>>
>>> 2017-09-13 16:22 GMT-03:00 Malitha Kabir :
>>>
 Hi Wandré,

 It seems you already did intense research on it. Kindly accept my
 comments as an addition to your idea (not the answer you trying to find
 out). In my idea, categorizing molecules using it's descriptor should
 reduce computation time. RDKit currently offer calculation of about 200
 descriptors! So, a careful look up at those makes a lot of sense to me.
 Conceptually, descriptor matching should follow a sequence (I don't know
 what sequence would be ideal) - for example MolWt should match first (H
 contribution and ions should be taken into consideration here) and then
 subsequent matching of other descriptors (might be different while writing
 programs). There are a few reading materials on molecular fingerprint and
 database schema. You may have a look at those.

 The links are from Daylight. I am neither involved with the company nor
 their product.
 http://www.daylight.com/dayhtml/doc/theory/theory.finger.html
 http://www.daylight.com/dayhtml/doc/theory/theory.thor.html

 Best regards,
 - malitha


 On Thu, Sep 14, 2017 at 12:43 AM, Wandré 
 wrote:

> Thanks for all the answers.
>
> Reading all answers, I think in something different... If the SMILES
> (Chem.MolToSmiles(mol,isomericSmiles=True)) and Inchi
> (Chem.MolToInchi(mol)) can generate the same value in different molecules,
> I will generate others descriptors (NumHDonors, NumHAcceptors, Ri
> ngCount, GetNumAtoms, TPSA, pyLabuteASA, MolWt, CalcNumRotatableBonds
> and MolLogP) to compare all the molecules that SMILES and Inchi are the
> same.
> If all this data are the same, I will generate the fingerprint
> (Atompair for exemple) and use Tanimoto coefficient and, if this value,
> when I compare two molecules, is 1, this molecules are the same.
>
> Where is my mistake (I think that is, one or more, mistakes)?
>
> Thanks!
>
> --
> Wandré Nunes de Pinho Veloso
> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
> Doutorando em Bioinformática - Universidade Federal de Minas Gerais -
> UFMG
> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
> Inteligência Computacional - UNIFEI
> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>
> 2017-09-13 14:19 GMT-03:00 Dimitri Maziuk :
>
>> On 09/13/2017 11:46 AM, Markus Sitzmann wrote:
>> > The case that

Re: [Rdkit-discuss] Non-redundant database of molecules

2017-09-13 Thread Markus Sitzmann
How did you install rdkit so far? And where? Is it the conda/anaconda
version?

On Wed, Sep 13, 2017 at 9:39 PM, Wandré  wrote:

> How to install RDKit with InChI?
> When I run Chem.inchi.INCHI_AVAILABLE, the result is False
>
> --
> Wandré Nunes de Pinho Veloso
> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
> Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
> Inteligência Computacional - UNIFEI
> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>
> 2017-09-13 16:30 GMT-03:00 Wandré :
>
>> Thanks Malitha.
>> I choose this descriptors because I will store this on my database, so,
>> will be fast compare one molecule before insert them in database.
>> My worry now is if the RDKit will generate different SMILES or InChI in
>> same SDF molecule or equals in different molecules (molecules from RCSB
>> PDB, PubChem, ChemBL, for example).
>>
>> --
>> Wandré Nunes de Pinho Veloso
>> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
>> Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
>> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
>> Inteligência Computacional - UNIFEI
>> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
>> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
>> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>>
>> 2017-09-13 16:22 GMT-03:00 Malitha Kabir :
>>
>>> Hi Wandré,
>>>
>>> It seems you already did intense research on it. Kindly accept my
>>> comments as an addition to your idea (not the answer you trying to find
>>> out). In my idea, categorizing molecules using it's descriptor should
>>> reduce computation time. RDKit currently offer calculation of about 200
>>> descriptors! So, a careful look up at those makes a lot of sense to me.
>>> Conceptually, descriptor matching should follow a sequence (I don't know
>>> what sequence would be ideal) - for example MolWt should match first (H
>>> contribution and ions should be taken into consideration here) and then
>>> subsequent matching of other descriptors (might be different while writing
>>> programs). There are a few reading materials on molecular fingerprint and
>>> database schema. You may have a look at those.
>>>
>>> The links are from Daylight. I am neither involved with the company nor
>>> their product.
>>> http://www.daylight.com/dayhtml/doc/theory/theory.finger.html
>>> http://www.daylight.com/dayhtml/doc/theory/theory.thor.html
>>>
>>> Best regards,
>>> - malitha
>>>
>>>
>>> On Thu, Sep 14, 2017 at 12:43 AM, Wandré  wrote:
>>>
 Thanks for all the answers.

 Reading all answers, I think in something different... If the SMILES
 (Chem.MolToSmiles(mol,isomericSmiles=True)) and Inchi
 (Chem.MolToInchi(mol)) can generate the same value in different molecules,
 I will generate others descriptors (NumHDonors, NumHAcceptors, Ri
 ngCount, GetNumAtoms, TPSA, pyLabuteASA, MolWt, CalcNumRotatableBonds
 and MolLogP) to compare all the molecules that SMILES and Inchi are the
 same.
 If all this data are the same, I will generate the fingerprint
 (Atompair for exemple) and use Tanimoto coefficient and, if this value,
 when I compare two molecules, is 1, this molecules are the same.

 Where is my mistake (I think that is, one or more, mistakes)?

 Thanks!

 --
 Wandré Nunes de Pinho Veloso
 Professor Assistente - Unifei - Campus Avançado de Itabira-MG
 Doutorando em Bioinformática - Universidade Federal de Minas Gerais -
 UFMG
 Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
 Inteligência Computacional - UNIFEI
 Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
 Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
 Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG

 2017-09-13 14:19 GMT-03:00 Dimitri Maziuk :

> On 09/13/2017 11:46 AM, Markus Sitzmann wrote:
> > The case that you have 3D information available for a molecule
> dataset is rare, if you want it trustworthy it gets even worse than that.
> And what is the point then to generate the configuration of a molecule
> first if you can not trust that either?
>
> Veering further off topic, do you even care in the first place? E.g. if
> your molecule always exists as a mixture of isomers, except in some
> megabuck-per-microgram painstakingly created reference samples, a
> 3D-based system will represent it as two distinct molecules. Whereas
> you
> want it represented as one.
>
> Last I looked PDB Ligand Expo had two different benzenes. Their
> software
> doesn't (didn't?) do the circle version so they don't have the third
> one.
>

Re: [Rdkit-discuss] HasSubstructMatch doesn't work as expected

2017-09-13 Thread Michal Krompiec
I'm afraid it won't work in the general case (i.e. you can make it work for
some classes of compounds, but not without unwanted side effects on others)
if the aromaticity model of the other cartridge is different - and it seems
to be the case here...

On Wednesday, 13 September 2017, Michał Nowotka  wrote:

> OK, so what I have is some substructure results from other (non-rdkit)
> cartridge and I want to use rdkit to generate images of all results
> with the query substracture highlighed and aligned.
> So I have two things: a list of compounds and a query compound.
> Now I need to highlight the query compound for every compound from the
> list and I need to do it at all costs. I can't leave any compound not
> highlighted even if rdkit by default has a different opinion weather
> the query compound really is a true substructure of a given compound.
>
> So how can I instruct rdkit to ignore aromacity and other factors,
> preferably one by one, each time going one level deeper where the last
> resort would be simply matching on the level of two planar graphs. Is
> that possible?
>
> On Wed, Sep 13, 2017 at 4:48 PM, Peter S. Shenkin  > wrote:
> > Your course of action depends upon just what you are really trying to
> do. If
> > it's only aspirin, then why wouldn't you just do it manually? If it goes
> > beyond aspirin, you have to start by defining in general terms exactly
> what
> > you want to match to what.
> >
> > For example, given a query molecule (aspirin in this case), if you want
> all
> > its non-aromatic atoms to match aromatic as well as non-aromatic atoms in
> > the database, you could write a string-alteration routine to munge the
> > SMILES of a query molecule into a SMARTS that would do just that, and
> then
> > use that SMARTS to match your database molecules. Repeat for each query
> > molecule.
> >
> > But you have to start with a precise definition of just what kind of
> > matching you wish to do. For instance, maybe you don't really want
> > non-aromatic ring atoms in your query to match aromatic rings and vice
> versa
> > (i.e., a cyclohexyl to match a phenyl); maybe you only want non-ring
> atoms
> > in the query to match aliphatic as well as aromatic substructures. And so
> > on.
> >
> > -P.
> >
> >
> > On Wed, Sep 13, 2017 at 10:42 AM, Michał Nowotka  > wrote:
> >>
> >> Is there any flag in RDkit to match both 'normal' aspirin and embedded
> >> aromatic analogues?
> >> The problem is that I can't modify user queries by hand in real time :)
> >>
> >> On Wed, Sep 13, 2017 at 2:12 PM, Chris Earnshaw  >
> >> wrote:
> >> > Hi
> >> >
> >> > The problem is due to RDkit perceiving the embedded pyranone in
> >> > CHEMBL1999443 as an aromatic system, which is probably correct.
> However,
> >> > in
> >> > the structure of aspirin the carboxyl carbon and singly bonded oxygen
> >> > are
> >> > non-aromatic, so if you just use the SMILES of aspirin as a query it
> >> > won't
> >> > match CHEMBL1999443
> >> >
> >> > You'll need to use a slightly more generic aspirin-like query to allow
> >> > the
> >> > possibility of matching both 'normal' aspirin and embedded aromatic
> >> > analogues. CC(=O)Oc1c1[#6](=O)[#8] should work OK.
> >> >
> >> > Regards,
> >> > Chris
> >> >
> >> > On 13 September 2017 at 13:40, Michał Nowotka  > wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> This problem is probably due to my lack of chemistry knowledge but
> >> >> plese have a look:
> >> >>
> >> >> If I do a substructure search in ChEMBL using aspirin (CHEMBL25) as a
> >> >> query (ChEMBL API uses the Symix catridge):
> >> >>
> >> >> from chembl_webresource_client.new_client import new_client
> >> >> res = new_client.substructure.filter(chembl_id='CHEMBL25')
> >> >>
> >> >> One of them will be CHEMBL1999443:
> >> >>
> >> >> 'CHEMBL1999443' in (r['molecule_chembl_id'] for r in res)
> >> >> >>> True
> >> >>
> >> >> Now I take the molfile:
> >> >>
> >> >> new_client.molecule.set_format('mol')
> >> >> mol = new_client.molecule.get('CHEMBL1999443')
> >> >>
> >> >> and load it with aspirin into rdkit:
> >> >>
> >> >> from rdkit import Chem
> >> >> m = Chem.MolFromMolBlock(mol)
> >> >> pattern = Chem.MolFromMolBlock(new_
> client.molecule.get('CHEMBL25'))
> >> >>
> >> >> If I check if it has an aspirin as a substructure using rdkit, I'm
> >> >> getting false...
> >> >>
> >> >> m.HasSubstructMatch(pattern)
> >> >> >>> False
> >> >>
> >> >> Looking at this blog post:
> >> >>
> >> >>
> >> >> https://github.com/rdkit/rdkit-tutorials/blob/master/
> notebooks/002_SMARTS_SubstructureMatching.ipynb
> >> >> I tried to initialize rings and retry:
> >> >>
> >> >>  Chem.GetSymmSSSR(m)
> >> >>  m.HasSubstructMatch(pattern)
> >> >>  >>>False
> >> >>
> >> >> Chem.GetSymmSSSR(pattern)
> >> >> m.HasSubstructMatch(pattern)
> >> >> >>>False
> >> >>
> >> >> But as you can see without any luck. Is there anything else I can do
> >> >> to get the match anyway?
> >> >> Without having a

Re: [Rdkit-discuss] Non-redundant database of molecules

2017-09-13 Thread Markus Sitzmann
Hi Wandré,

your problem is the opposite - it is quite unlikely, actually impossible,
that different molecules calculate the same InChI or SMILES, your bigger
problem is, that what you regard as the same chemical, is regarded as
different ones by SMILES or InChI. The danger for this is quite big for
SMILES. it becomes better with canonical SMILES (but in my opinion, not
much), your best friend is InChI or Standard InChI.

Also, if two different molecules would calculate the same InChI or SMILES,
in all likelihood all your descriptors are very similar, too, because
SMILES, InChI etc. are just connection table representations and those
descriptor calculating algorithms just work on the connection table (so,
the molecules also look the same for any of these algorithms).

Calculation of Tanimoto coefficient-type doesn't help this problem either,
and a Tanimoto coefficient of 1 doesn't mean two molecules are identical
(they are very similar but not identical).

Markus

On Wed, Sep 13, 2017 at 8:43 PM, Wandré  wrote:

> Thanks for all the answers.
>
> Reading all answers, I think in something different... If the SMILES
> (Chem.MolToSmiles(mol,isomericSmiles=True)) and Inchi
> (Chem.MolToInchi(mol)) can generate the same value in different molecules,
> I will generate others descriptors (NumHDonors, NumHAcceptors,
> RingCount, GetNumAtoms, TPSA, pyLabuteASA, MolWt, CalcNumRotatableBonds
> and MolLogP) to compare all the molecules that SMILES and Inchi are the
> same.
> If all this data are the same, I will generate the fingerprint (Atompair
> for exemple) and use Tanimoto coefficient and, if this value, when I
> compare two molecules, is 1, this molecules are the same.
>
> Where is my mistake (I think that is, one or more, mistakes)?
>
> Thanks!
>
> --
> Wandré Nunes de Pinho Veloso
> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
> Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
> Inteligência Computacional - UNIFEI
> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>
> 2017-09-13 14:19 GMT-03:00 Dimitri Maziuk :
>
>> On 09/13/2017 11:46 AM, Markus Sitzmann wrote:
>> > The case that you have 3D information available for a molecule dataset
>> is rare, if you want it trustworthy it gets even worse than that. And what
>> is the point then to generate the configuration of a molecule first if you
>> can not trust that either?
>>
>> Veering further off topic, do you even care in the first place? E.g. if
>> your molecule always exists as a mixture of isomers, except in some
>> megabuck-per-microgram painstakingly created reference samples, a
>> 3D-based system will represent it as two distinct molecules. Whereas you
>> want it represented as one.
>>
>> Last I looked PDB Ligand Expo had two different benzenes. Their software
>> doesn't (didn't?) do the circle version so they don't have the third one.
>>
>> --
>> Dimitri Maziuk
>> Programmer/sysadmin
>> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
>>
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Non-redundant database of molecules

2017-09-13 Thread Wandré
How to install RDKit with InChI?
When I run Chem.inchi.INCHI_AVAILABLE, the result is False

--
Wandré Nunes de Pinho Veloso
Professor Assistente - Unifei - Campus Avançado de Itabira-MG
Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
Inteligência Computacional - UNIFEI
Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG

2017-09-13 16:30 GMT-03:00 Wandré :

> Thanks Malitha.
> I choose this descriptors because I will store this on my database, so,
> will be fast compare one molecule before insert them in database.
> My worry now is if the RDKit will generate different SMILES or InChI in
> same SDF molecule or equals in different molecules (molecules from RCSB
> PDB, PubChem, ChemBL, for example).
>
> --
> Wandré Nunes de Pinho Veloso
> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
> Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
> Inteligência Computacional - UNIFEI
> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>
> 2017-09-13 16:22 GMT-03:00 Malitha Kabir :
>
>> Hi Wandré,
>>
>> It seems you already did intense research on it. Kindly accept my
>> comments as an addition to your idea (not the answer you trying to find
>> out). In my idea, categorizing molecules using it's descriptor should
>> reduce computation time. RDKit currently offer calculation of about 200
>> descriptors! So, a careful look up at those makes a lot of sense to me.
>> Conceptually, descriptor matching should follow a sequence (I don't know
>> what sequence would be ideal) - for example MolWt should match first (H
>> contribution and ions should be taken into consideration here) and then
>> subsequent matching of other descriptors (might be different while writing
>> programs). There are a few reading materials on molecular fingerprint and
>> database schema. You may have a look at those.
>>
>> The links are from Daylight. I am neither involved with the company nor
>> their product.
>> http://www.daylight.com/dayhtml/doc/theory/theory.finger.html
>> http://www.daylight.com/dayhtml/doc/theory/theory.thor.html
>>
>> Best regards,
>> - malitha
>>
>>
>> On Thu, Sep 14, 2017 at 12:43 AM, Wandré  wrote:
>>
>>> Thanks for all the answers.
>>>
>>> Reading all answers, I think in something different... If the SMILES
>>> (Chem.MolToSmiles(mol,isomericSmiles=True)) and Inchi
>>> (Chem.MolToInchi(mol)) can generate the same value in different molecules,
>>> I will generate others descriptors (NumHDonors, NumHAcceptors, Ri
>>> ngCount, GetNumAtoms, TPSA, pyLabuteASA, MolWt, CalcNumRotatableBonds
>>> and MolLogP) to compare all the molecules that SMILES and Inchi are the
>>> same.
>>> If all this data are the same, I will generate the fingerprint (Atompair
>>> for exemple) and use Tanimoto coefficient and, if this value, when I
>>> compare two molecules, is 1, this molecules are the same.
>>>
>>> Where is my mistake (I think that is, one or more, mistakes)?
>>>
>>> Thanks!
>>>
>>> --
>>> Wandré Nunes de Pinho Veloso
>>> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
>>> Doutorando em Bioinformática - Universidade Federal de Minas Gerais -
>>> UFMG
>>> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
>>> Inteligência Computacional - UNIFEI
>>> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
>>> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
>>> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>>>
>>> 2017-09-13 14:19 GMT-03:00 Dimitri Maziuk :
>>>
 On 09/13/2017 11:46 AM, Markus Sitzmann wrote:
 > The case that you have 3D information available for a molecule
 dataset is rare, if you want it trustworthy it gets even worse than that.
 And what is the point then to generate the configuration of a molecule
 first if you can not trust that either?

 Veering further off topic, do you even care in the first place? E.g. if
 your molecule always exists as a mixture of isomers, except in some
 megabuck-per-microgram painstakingly created reference samples, a
 3D-based system will represent it as two distinct molecules. Whereas you
 want it represented as one.

 Last I looked PDB Ligand Expo had two different benzenes. Their software
 doesn't (didn't?) do the circle version so they don't have the third
 one.

 --
 Dimitri Maziuk
 Programmer/sysadmin
 BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu


 
 --
 Check out the vibrant tech community on one o

Re: [Rdkit-discuss] Non-redundant database of molecules

2017-09-13 Thread Wandré
Thanks Malitha.
I choose this descriptors because I will store this on my database, so,
will be fast compare one molecule before insert them in database.
My worry now is if the RDKit will generate different SMILES or InChI in
same SDF molecule or equals in different molecules (molecules from RCSB
PDB, PubChem, ChemBL, for example).

--
Wandré Nunes de Pinho Veloso
Professor Assistente - Unifei - Campus Avançado de Itabira-MG
Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
Inteligência Computacional - UNIFEI
Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG

2017-09-13 16:22 GMT-03:00 Malitha Kabir :

> Hi Wandré,
>
> It seems you already did intense research on it. Kindly accept my comments
> as an addition to your idea (not the answer you trying to find out). In my
> idea, categorizing molecules using it's descriptor should reduce
> computation time. RDKit currently offer calculation of about 200
> descriptors! So, a careful look up at those makes a lot of sense to me.
> Conceptually, descriptor matching should follow a sequence (I don't know
> what sequence would be ideal) - for example MolWt should match first (H
> contribution and ions should be taken into consideration here) and then
> subsequent matching of other descriptors (might be different while writing
> programs). There are a few reading materials on molecular fingerprint and
> database schema. You may have a look at those.
>
> The links are from Daylight. I am neither involved with the company nor
> their product.
> http://www.daylight.com/dayhtml/doc/theory/theory.finger.html
> http://www.daylight.com/dayhtml/doc/theory/theory.thor.html
>
> Best regards,
> - malitha
>
>
> On Thu, Sep 14, 2017 at 12:43 AM, Wandré  wrote:
>
>> Thanks for all the answers.
>>
>> Reading all answers, I think in something different... If the SMILES
>> (Chem.MolToSmiles(mol,isomericSmiles=True)) and Inchi
>> (Chem.MolToInchi(mol)) can generate the same value in different molecules,
>> I will generate others descriptors (NumHDonors, NumHAcceptors, Ri
>> ngCount, GetNumAtoms, TPSA, pyLabuteASA, MolWt, CalcNumRotatableBonds
>> and MolLogP) to compare all the molecules that SMILES and Inchi are the
>> same.
>> If all this data are the same, I will generate the fingerprint (Atompair
>> for exemple) and use Tanimoto coefficient and, if this value, when I
>> compare two molecules, is 1, this molecules are the same.
>>
>> Where is my mistake (I think that is, one or more, mistakes)?
>>
>> Thanks!
>>
>> --
>> Wandré Nunes de Pinho Veloso
>> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
>> Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
>> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
>> Inteligência Computacional - UNIFEI
>> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
>> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
>> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>>
>> 2017-09-13 14:19 GMT-03:00 Dimitri Maziuk :
>>
>>> On 09/13/2017 11:46 AM, Markus Sitzmann wrote:
>>> > The case that you have 3D information available for a molecule dataset
>>> is rare, if you want it trustworthy it gets even worse than that. And what
>>> is the point then to generate the configuration of a molecule first if you
>>> can not trust that either?
>>>
>>> Veering further off topic, do you even care in the first place? E.g. if
>>> your molecule always exists as a mixture of isomers, except in some
>>> megabuck-per-microgram painstakingly created reference samples, a
>>> 3D-based system will represent it as two distinct molecules. Whereas you
>>> want it represented as one.
>>>
>>> Last I looked PDB Ligand Expo had two different benzenes. Their software
>>> doesn't (didn't?) do the circle version so they don't have the third one.
>>>
>>> --
>>> Dimitri Maziuk
>>> Programmer/sysadmin
>>> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
>>>
>>>
>>> 
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
--

Re: [Rdkit-discuss] Non-redundant database of molecules

2017-09-13 Thread Malitha Kabir
Hi Wandré,

It seems you already did intense research on it. Kindly accept my comments
as an addition to your idea (not the answer you trying to find out). In my
idea, categorizing molecules using it's descriptor should reduce
computation time. RDKit currently offer calculation of about 200
descriptors! So, a careful look up at those makes a lot of sense to me.
Conceptually, descriptor matching should follow a sequence (I don't know
what sequence would be ideal) - for example MolWt should match first (H
contribution and ions should be taken into consideration here) and then
subsequent matching of other descriptors (might be different while writing
programs). There are a few reading materials on molecular fingerprint and
database schema. You may have a look at those.

The links are from Daylight. I am neither involved with the company nor
their product.
http://www.daylight.com/dayhtml/doc/theory/theory.finger.html
http://www.daylight.com/dayhtml/doc/theory/theory.thor.html

Best regards,
- malitha


On Thu, Sep 14, 2017 at 12:43 AM, Wandré  wrote:

> Thanks for all the answers.
>
> Reading all answers, I think in something different... If the SMILES
> (Chem.MolToSmiles(mol,isomericSmiles=True)) and Inchi
> (Chem.MolToInchi(mol)) can generate the same value in different molecules,
> I will generate others descriptors (NumHDonors, NumHAcceptors,
> RingCount, GetNumAtoms, TPSA, pyLabuteASA, MolWt, CalcNumRotatableBonds
> and MolLogP) to compare all the molecules that SMILES and Inchi are the
> same.
> If all this data are the same, I will generate the fingerprint (Atompair
> for exemple) and use Tanimoto coefficient and, if this value, when I
> compare two molecules, is 1, this molecules are the same.
>
> Where is my mistake (I think that is, one or more, mistakes)?
>
> Thanks!
>
> --
> Wandré Nunes de Pinho Veloso
> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
> Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
> Inteligência Computacional - UNIFEI
> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>
> 2017-09-13 14:19 GMT-03:00 Dimitri Maziuk :
>
>> On 09/13/2017 11:46 AM, Markus Sitzmann wrote:
>> > The case that you have 3D information available for a molecule dataset
>> is rare, if you want it trustworthy it gets even worse than that. And what
>> is the point then to generate the configuration of a molecule first if you
>> can not trust that either?
>>
>> Veering further off topic, do you even care in the first place? E.g. if
>> your molecule always exists as a mixture of isomers, except in some
>> megabuck-per-microgram painstakingly created reference samples, a
>> 3D-based system will represent it as two distinct molecules. Whereas you
>> want it represented as one.
>>
>> Last I looked PDB Ligand Expo had two different benzenes. Their software
>> doesn't (didn't?) do the circle version so they don't have the third one.
>>
>> --
>> Dimitri Maziuk
>> Programmer/sysadmin
>> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
>>
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Non-redundant database of molecules (Wandr?)

2017-09-13 Thread Wandré
Why don't use the InChI function on RDKit?
Canonical SMILES cannot be generated by RDKit, correct?

--
Wandré Nunes de Pinho Veloso
Professor Assistente - Unifei - Campus Avançado de Itabira-MG
Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
Inteligência Computacional - UNIFEI
Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG

2017-09-13 15:57 GMT-03:00 Chris Swain :

> Hi,
>
> I’d use a text based version of the structure InChiKey or canonical SMILES
> it then becomes a easy task to do the comparison in Python
>
> I wrote a script to do this in Vortex but it should be easy to modify.
> https://www.macinchem.org/reviews/vortex/tut28/scripting_vortex28.php
>
>
> Cheers
>
> Chris
>
>
>
> Today's Topics:
>
>   1. Non-redundant database of molecules (Wandr?)
>
>
> --
>
> Message: 1
> Date: Wed, 13 Sep 2017 07:13:56 -0300
> From: Wandr? 
> To: rdkit-discuss@lists.sourceforge.net
> Subject: [Rdkit-discuss] Non-redundant database of molecules
> Message-ID:
> 
> Content-Type: text/plain; charset="utf-8"
>
> Hi,
>
> My name is Wandr? and I'm from Brazil.
> I'm trying to do a big database of molecules, but, I want to eliminate all
> the redundant molecules before insert them in database.
> I want to know what is the best method to identify one molecule in RDKit.
> Is SMILES ("Chem.MolToSmiles(mol,isomericSmiles=True)") or I will need to
> compare all molecules, one by one, before insert them in database (using
> Tanimoto)?
> This can be hard to do because my database will have lot of millions of
> molecules, so, compare one by one before insert is the only answer?
> Compare if the SMILES as already inserted is easy (text compare), but,
> compare fingerprint of molecule...
>
> If I really need to compare the fingerprint of molecule, how to store this
> data in PostgreSQL without use cartridge? I will generate the fingeprint
> (Atompair, for example) and store this fingerprint in database and compare
> all the fingerprints, one by one, before insert a now molecule. This
> fingerprint (Atompair) have lot of features, so, store this in relational
> database is expensive.
> It is possible?
>
> Thanks!
>
> --
> Wandr? Nunes de Pinho Veloso
> Professor Assistente - Unifei - Campus Avan?ado de Itabira-MG
> Doutorando em Bioinform?tica - Universidade Federal de Minas Gerais - UFMG
> Pesquisador do INSILICO - Grupo Interdisciplinar em Simula??o e
> Intelig?ncia Computacional - UNIFEI
> Membro do Grupo de Pesquisa Assinaturas Biol?gicas da FIOCRUZ
> Membro do Grupo de Pesquisa Bioinform?tica Estrutural da UFMG
> Laborat?rio de Bioinform?tica e Sistemas - LBS, DCC, UFMG
> -- next part --
> An HTML attachment was scrubbed...
>
> --
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>
> --
>
> Subject: Digest Footer
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
> --
>
> End of Rdkit-discuss Digest, Vol 119, Issue 20
> **
>
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Non-redundant database of molecules (Wandr?)

2017-09-13 Thread Chris Swain
Hi,

I’d use a text based version of the structure InChiKey or canonical SMILES it 
then becomes a easy task to do the comparison in Python

I wrote a script to do this in Vortex but it should be easy to modify.
https://www.macinchem.org/reviews/vortex/tut28/scripting_vortex28.php 



Cheers

Chris
> 
> 
> Today's Topics:
> 
>   1. Non-redundant database of molecules (Wandr?)
> 
> 
> --
> 
> Message: 1
> Date: Wed, 13 Sep 2017 07:13:56 -0300
> From: Wandr? 
> To: rdkit-discuss@lists.sourceforge.net
> Subject: [Rdkit-discuss] Non-redundant database of molecules
> Message-ID:
>   
> Content-Type: text/plain; charset="utf-8"
> 
> Hi,
> 
> My name is Wandr? and I'm from Brazil.
> I'm trying to do a big database of molecules, but, I want to eliminate all
> the redundant molecules before insert them in database.
> I want to know what is the best method to identify one molecule in RDKit.
> Is SMILES ("Chem.MolToSmiles(mol,isomericSmiles=True)") or I will need to
> compare all molecules, one by one, before insert them in database (using
> Tanimoto)?
> This can be hard to do because my database will have lot of millions of
> molecules, so, compare one by one before insert is the only answer?
> Compare if the SMILES as already inserted is easy (text compare), but,
> compare fingerprint of molecule...
> 
> If I really need to compare the fingerprint of molecule, how to store this
> data in PostgreSQL without use cartridge? I will generate the fingeprint
> (Atompair, for example) and store this fingerprint in database and compare
> all the fingerprints, one by one, before insert a now molecule. This
> fingerprint (Atompair) have lot of features, so, store this in relational
> database is expensive.
> It is possible?
> 
> Thanks!
> 
> --
> Wandr? Nunes de Pinho Veloso
> Professor Assistente - Unifei - Campus Avan?ado de Itabira-MG
> Doutorando em Bioinform?tica - Universidade Federal de Minas Gerais - UFMG
> Pesquisador do INSILICO - Grupo Interdisciplinar em Simula??o e
> Intelig?ncia Computacional - UNIFEI
> Membro do Grupo de Pesquisa Assinaturas Biol?gicas da FIOCRUZ
> Membro do Grupo de Pesquisa Bioinform?tica Estrutural da UFMG
> Laborat?rio de Bioinform?tica e Sistemas - LBS, DCC, UFMG
> -- next part --
> An HTML attachment was scrubbed...
> 
> --
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> 
> --
> 
> Subject: Digest Footer
> 
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> 
> 
> --
> 
> End of Rdkit-discuss Digest, Vol 119, Issue 20
> **

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Non-redundant database of molecules

2017-09-13 Thread Wandré
Thanks for all the answers.

Reading all answers, I think in something different... If the SMILES
(Chem.MolToSmiles(mol,isomericSmiles=True)) and Inchi
(Chem.MolToInchi(mol)) can generate the same value in different molecules,
I will generate others descriptors
(NumHDonors, NumHAcceptors, RingCount, GetNumAtoms, TPSA, pyLabuteASA,
MolWt, CalcNumRotatableBonds
and MolLogP) to compare all the molecules that SMILES and Inchi are the
same.
If all this data are the same, I will generate the fingerprint (Atompair
for exemple) and use Tanimoto coefficient and, if this value, when I
compare two molecules, is 1, this molecules are the same.

Where is my mistake (I think that is, one or more, mistakes)?

Thanks!

--
Wandré Nunes de Pinho Veloso
Professor Assistente - Unifei - Campus Avançado de Itabira-MG
Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
Inteligência Computacional - UNIFEI
Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG

2017-09-13 14:19 GMT-03:00 Dimitri Maziuk :

> On 09/13/2017 11:46 AM, Markus Sitzmann wrote:
> > The case that you have 3D information available for a molecule dataset
> is rare, if you want it trustworthy it gets even worse than that. And what
> is the point then to generate the configuration of a molecule first if you
> can not trust that either?
>
> Veering further off topic, do you even care in the first place? E.g. if
> your molecule always exists as a mixture of isomers, except in some
> megabuck-per-microgram painstakingly created reference samples, a
> 3D-based system will represent it as two distinct molecules. Whereas you
> want it represented as one.
>
> Last I looked PDB Ligand Expo had two different benzenes. Their software
> doesn't (didn't?) do the circle version so they don't have the third one.
>
> --
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Non-redundant database of molecules

2017-09-13 Thread Dimitri Maziuk
On 09/13/2017 11:46 AM, Markus Sitzmann wrote:
> The case that you have 3D information available for a molecule dataset is 
> rare, if you want it trustworthy it gets even worse than that. And what is 
> the point then to generate the configuration of a molecule first if you can 
> not trust that either?

Veering further off topic, do you even care in the first place? E.g. if
your molecule always exists as a mixture of isomers, except in some
megabuck-per-microgram painstakingly created reference samples, a
3D-based system will represent it as two distinct molecules. Whereas you
want it represented as one.

Last I looked PDB Ligand Expo had two different benzenes. Their software
doesn't (didn't?) do the circle version so they don't have the third one.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Non-redundant database of molecules

2017-09-13 Thread Markus Sitzmann
The case that you have 3D information available for a molecule dataset is rare, 
if you want it trustworthy it gets even worse than that. And what is the point 
then to generate the configuration of a molecule first if you can not trust 
that either?

-
|  Markus Sitzmann
|  markus.sitzm...@gmail.com

> On 13. Sep 2017, at 17:58, Dimitri Maziuk  wrote:
> 
>> On 2017-09-13 10:17, Markus Sitzmann wrote:
>> Canonical SMILES are only a very rough approximation for "unique molecule" 
>> as they usually don't work well for tautomeric forms of compound.
>> InChI or Standard InChI is much better although also not perfect.
> 
> ALATIS I linked to above does impose a stable consistent ordering for 
> everything including hydrogens. The downside is it's garbage in - garbage 
> out: you need to start with a 3D structure, otherwise it has an option to 
> addHs and gen3D but no guarantee it'll generate the one you want.
> 
> Dima
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] HasSubstructMatch doesn't work as expected

2017-09-13 Thread Michał Nowotka
OK, so what I have is some substructure results from other (non-rdkit)
cartridge and I want to use rdkit to generate images of all results
with the query substracture highlighed and aligned.
So I have two things: a list of compounds and a query compound.
Now I need to highlight the query compound for every compound from the
list and I need to do it at all costs. I can't leave any compound not
highlighted even if rdkit by default has a different opinion weather
the query compound really is a true substructure of a given compound.

So how can I instruct rdkit to ignore aromacity and other factors,
preferably one by one, each time going one level deeper where the last
resort would be simply matching on the level of two planar graphs. Is
that possible?

On Wed, Sep 13, 2017 at 4:48 PM, Peter S. Shenkin  wrote:
> Your course of action depends upon just what you are really trying to do. If
> it's only aspirin, then why wouldn't you just do it manually? If it goes
> beyond aspirin, you have to start by defining in general terms exactly what
> you want to match to what.
>
> For example, given a query molecule (aspirin in this case), if you want all
> its non-aromatic atoms to match aromatic as well as non-aromatic atoms in
> the database, you could write a string-alteration routine to munge the
> SMILES of a query molecule into a SMARTS that would do just that, and then
> use that SMARTS to match your database molecules. Repeat for each query
> molecule.
>
> But you have to start with a precise definition of just what kind of
> matching you wish to do. For instance, maybe you don't really want
> non-aromatic ring atoms in your query to match aromatic rings and vice versa
> (i.e., a cyclohexyl to match a phenyl); maybe you only want non-ring atoms
> in the query to match aliphatic as well as aromatic substructures. And so
> on.
>
> -P.
>
>
> On Wed, Sep 13, 2017 at 10:42 AM, Michał Nowotka  wrote:
>>
>> Is there any flag in RDkit to match both 'normal' aspirin and embedded
>> aromatic analogues?
>> The problem is that I can't modify user queries by hand in real time :)
>>
>> On Wed, Sep 13, 2017 at 2:12 PM, Chris Earnshaw 
>> wrote:
>> > Hi
>> >
>> > The problem is due to RDkit perceiving the embedded pyranone in
>> > CHEMBL1999443 as an aromatic system, which is probably correct. However,
>> > in
>> > the structure of aspirin the carboxyl carbon and singly bonded oxygen
>> > are
>> > non-aromatic, so if you just use the SMILES of aspirin as a query it
>> > won't
>> > match CHEMBL1999443
>> >
>> > You'll need to use a slightly more generic aspirin-like query to allow
>> > the
>> > possibility of matching both 'normal' aspirin and embedded aromatic
>> > analogues. CC(=O)Oc1c1[#6](=O)[#8] should work OK.
>> >
>> > Regards,
>> > Chris
>> >
>> > On 13 September 2017 at 13:40, Michał Nowotka  wrote:
>> >>
>> >> Hi,
>> >>
>> >> This problem is probably due to my lack of chemistry knowledge but
>> >> plese have a look:
>> >>
>> >> If I do a substructure search in ChEMBL using aspirin (CHEMBL25) as a
>> >> query (ChEMBL API uses the Symix catridge):
>> >>
>> >> from chembl_webresource_client.new_client import new_client
>> >> res = new_client.substructure.filter(chembl_id='CHEMBL25')
>> >>
>> >> One of them will be CHEMBL1999443:
>> >>
>> >> 'CHEMBL1999443' in (r['molecule_chembl_id'] for r in res)
>> >> >>> True
>> >>
>> >> Now I take the molfile:
>> >>
>> >> new_client.molecule.set_format('mol')
>> >> mol = new_client.molecule.get('CHEMBL1999443')
>> >>
>> >> and load it with aspirin into rdkit:
>> >>
>> >> from rdkit import Chem
>> >> m = Chem.MolFromMolBlock(mol)
>> >> pattern = Chem.MolFromMolBlock(new_client.molecule.get('CHEMBL25'))
>> >>
>> >> If I check if it has an aspirin as a substructure using rdkit, I'm
>> >> getting false...
>> >>
>> >> m.HasSubstructMatch(pattern)
>> >> >>> False
>> >>
>> >> Looking at this blog post:
>> >>
>> >>
>> >> https://github.com/rdkit/rdkit-tutorials/blob/master/notebooks/002_SMARTS_SubstructureMatching.ipynb
>> >> I tried to initialize rings and retry:
>> >>
>> >>  Chem.GetSymmSSSR(m)
>> >>  m.HasSubstructMatch(pattern)
>> >>  >>>False
>> >>
>> >> Chem.GetSymmSSSR(pattern)
>> >> m.HasSubstructMatch(pattern)
>> >> >>>False
>> >>
>> >> But as you can see without any luck. Is there anything else I can do
>> >> to get the match anyway?
>> >> Without having a match I can't aligh and higlight asprin substructure
>> >> in CHEMBL1999443 image using GenerateDepictionMatching2DStructure and
>> >> DrawMolecule functions.
>> >>
>> >> Kind regards,
>> >>
>> >> Michał Nowotka
>> >>
>> >>
>> >>
>> >> --
>> >> Check out the vibrant tech community on one of the world's most
>> >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> >> ___
>> >> Rdkit-discuss mailing list
>> >> Rdkit-discuss@lists.so

Re: [Rdkit-discuss] Non-redundant database of molecules

2017-09-13 Thread Dimitri Maziuk

On 2017-09-13 10:17, Markus Sitzmann wrote:
Canonical SMILES are only a very rough approximation for "unique 
molecule" as they usually don't work well for tautomeric forms of compound.

InChI or Standard InChI is much better although also not perfect.


ALATIS I linked to above does impose a stable consistent ordering for 
everything including hydrogens. The downside is it's garbage in - 
garbage out: you need to start with a 3D structure, otherwise it has an 
option to addHs and gen3D but no guarantee it'll generate the one you want.


Dima

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] HasSubstructMatch doesn't work as expected

2017-09-13 Thread Peter S. Shenkin
Your course of action depends upon just what you are really trying to do.
If it's only aspirin, then why wouldn't you just do it manually? If it goes
beyond aspirin, you have to start by defining in general terms exactly what
you want to match to what.

For example, given a query molecule (aspirin in this case), if you want all
its non-aromatic atoms to match aromatic as well as non-aromatic atoms in
the database, you could write a string-alteration routine to munge the
SMILES of a query molecule into a SMARTS that would do just that, and then
use that SMARTS to match your database molecules. Repeat for each query
molecule.

But you have to start with a precise definition of just what kind of
matching you wish to do. For instance, maybe you don't really want
non-aromatic ring atoms in your query to match aromatic rings and vice
versa (i.e., a cyclohexyl to match a phenyl); maybe you only want non-ring
atoms in the query to match aliphatic as well as aromatic substructures.
And so on.

-P.


On Wed, Sep 13, 2017 at 10:42 AM, Michał Nowotka  wrote:

> Is there any flag in RDkit to match both 'normal' aspirin and embedded
> aromatic analogues?
> The problem is that I can't modify user queries by hand in real time :)
>
> On Wed, Sep 13, 2017 at 2:12 PM, Chris Earnshaw 
> wrote:
> > Hi
> >
> > The problem is due to RDkit perceiving the embedded pyranone in
> > CHEMBL1999443 as an aromatic system, which is probably correct. However,
> in
> > the structure of aspirin the carboxyl carbon and singly bonded oxygen are
> > non-aromatic, so if you just use the SMILES of aspirin as a query it
> won't
> > match CHEMBL1999443
> >
> > You'll need to use a slightly more generic aspirin-like query to allow
> the
> > possibility of matching both 'normal' aspirin and embedded aromatic
> > analogues. CC(=O)Oc1c1[#6](=O)[#8] should work OK.
> >
> > Regards,
> > Chris
> >
> > On 13 September 2017 at 13:40, Michał Nowotka  wrote:
> >>
> >> Hi,
> >>
> >> This problem is probably due to my lack of chemistry knowledge but
> >> plese have a look:
> >>
> >> If I do a substructure search in ChEMBL using aspirin (CHEMBL25) as a
> >> query (ChEMBL API uses the Symix catridge):
> >>
> >> from chembl_webresource_client.new_client import new_client
> >> res = new_client.substructure.filter(chembl_id='CHEMBL25')
> >>
> >> One of them will be CHEMBL1999443:
> >>
> >> 'CHEMBL1999443' in (r['molecule_chembl_id'] for r in res)
> >> >>> True
> >>
> >> Now I take the molfile:
> >>
> >> new_client.molecule.set_format('mol')
> >> mol = new_client.molecule.get('CHEMBL1999443')
> >>
> >> and load it with aspirin into rdkit:
> >>
> >> from rdkit import Chem
> >> m = Chem.MolFromMolBlock(mol)
> >> pattern = Chem.MolFromMolBlock(new_client.molecule.get('CHEMBL25'))
> >>
> >> If I check if it has an aspirin as a substructure using rdkit, I'm
> >> getting false...
> >>
> >> m.HasSubstructMatch(pattern)
> >> >>> False
> >>
> >> Looking at this blog post:
> >>
> >> https://github.com/rdkit/rdkit-tutorials/blob/master/
> notebooks/002_SMARTS_SubstructureMatching.ipynb
> >> I tried to initialize rings and retry:
> >>
> >>  Chem.GetSymmSSSR(m)
> >>  m.HasSubstructMatch(pattern)
> >>  >>>False
> >>
> >> Chem.GetSymmSSSR(pattern)
> >> m.HasSubstructMatch(pattern)
> >> >>>False
> >>
> >> But as you can see without any luck. Is there anything else I can do
> >> to get the match anyway?
> >> Without having a match I can't aligh and higlight asprin substructure
> >> in CHEMBL1999443 image using GenerateDepictionMatching2DStructure and
> >> DrawMolecule functions.
> >>
> >> Kind regards,
> >>
> >> Michał Nowotka
> >>
> >>
> >> 
> --
> >> Check out the vibrant tech community on one of the world's most
> >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> >> ___
> >> Rdkit-discuss mailing list
> >> Rdkit-discuss@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> >
> >
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Non-redundant database of molecules

2017-09-13 Thread Markus Sitzmann
Canonical SMILES are only a very rough approximation for "unique molecule"
as they usually don't work well for tautomeric forms of compound.
InChI or Standard InChI is much better although also not perfect.

The "perfect solution" depends also on how uniqueness or redundancy of
molecules is regarded for the purpose of the database.


On Wed, Sep 13, 2017 at 4:56 PM, TJ O'Donnell  wrote:

> Let the database do the work for you.  Create a canonical SMILES column
> and/or InChI column and declare them to be unique.  As you insert new
> rows, postgres will let  you know if there is already a row with the same
> SMILES or InChI.
> Here's some help on how to handle that.
> https://www.postgresql.org/docs/9.5/static/sql-insert.html#SQL-ON-CONFLICT
>
> TJ O'Donnell
>
> On Wed, Sep 13, 2017 at 3:13 AM, Wandré  wrote:
>
>> Hi,
>>
>> My name is Wandré and I'm from Brazil.
>> I'm trying to do a big database of molecules, but, I want to eliminate
>> all the redundant molecules before insert them in database.
>> I want to know what is the best method to identify one molecule in RDKit.
>> Is SMILES ("Chem.MolToSmiles(mol,isomericSmiles=True)") or I will need
>> to compare all molecules, one by one, before insert them in database (using
>> Tanimoto)?
>> This can be hard to do because my database will have lot of millions of
>> molecules, so, compare one by one before insert is the only answer?
>> Compare if the SMILES as already inserted is easy (text compare), but,
>> compare fingerprint of molecule...
>>
>> If I really need to compare the fingerprint of molecule, how to store
>> this data in PostgreSQL without use cartridge? I will generate the
>> fingeprint (Atompair, for example) and store this fingerprint in database
>> and compare all the fingerprints, one by one, before insert a now molecule.
>> This fingerprint (Atompair) have lot of features, so, store this in
>> relational database is expensive.
>> It is possible?
>>
>> Thanks!
>>
>> --
>> Wandré Nunes de Pinho Veloso
>> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
>> Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
>> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
>> Inteligência Computacional - UNIFEI
>> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
>> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
>> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Non-redundant database of molecules

2017-09-13 Thread Dimitri Maziuk

On 2017-09-13 09:56, TJ O'Donnell wrote:

Let the database do the work for you.  Create a canonical SMILES column
and/or InChI column and declare them to be unique.  As you insert new
rows, postgres will let  you know if there is already a row with the same
SMILES or InChI.
Here's some help on how to handle that.
https://www.postgresql.org/docs/9.5/static/sql-insert.html#SQL-ON-CONFLICT


One of the problems with this is it normally fails on the first conflict 
whereas users very often want a list of all conflicts to look at and see 
what's up. The above mentions a "special excludes table" in passing but 
I don't see anything about accessing it or what it actually contains.


If you don't care what molecules get dropped or why, "on conflict 
ignore" should work very nicely.


Dima

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Non-redundant database of molecules

2017-09-13 Thread TJ O'Donnell
Let the database do the work for you.  Create a canonical SMILES column
and/or InChI column and declare them to be unique.  As you insert new
rows, postgres will let  you know if there is already a row with the same
SMILES or InChI.
Here's some help on how to handle that.
https://www.postgresql.org/docs/9.5/static/sql-insert.html#SQL-ON-CONFLICT

TJ O'Donnell

On Wed, Sep 13, 2017 at 3:13 AM, Wandré  wrote:

> Hi,
>
> My name is Wandré and I'm from Brazil.
> I'm trying to do a big database of molecules, but, I want to eliminate all
> the redundant molecules before insert them in database.
> I want to know what is the best method to identify one molecule in RDKit.
> Is SMILES ("Chem.MolToSmiles(mol,isomericSmiles=True)") or I will need to
> compare all molecules, one by one, before insert them in database (using
> Tanimoto)?
> This can be hard to do because my database will have lot of millions of
> molecules, so, compare one by one before insert is the only answer?
> Compare if the SMILES as already inserted is easy (text compare), but,
> compare fingerprint of molecule...
>
> If I really need to compare the fingerprint of molecule, how to store this
> data in PostgreSQL without use cartridge? I will generate the fingeprint
> (Atompair, for example) and store this fingerprint in database and compare
> all the fingerprints, one by one, before insert a now molecule. This
> fingerprint (Atompair) have lot of features, so, store this in relational
> database is expensive.
> It is possible?
>
> Thanks!
>
> --
> Wandré Nunes de Pinho Veloso
> Professor Assistente - Unifei - Campus Avançado de Itabira-MG
> Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
> Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
> Inteligência Computacional - UNIFEI
> Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
> Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
> Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] HasSubstructMatch doesn't work as expected

2017-09-13 Thread Michał Nowotka
Is there any flag in RDkit to match both 'normal' aspirin and embedded
aromatic analogues?
The problem is that I can't modify user queries by hand in real time :)

On Wed, Sep 13, 2017 at 2:12 PM, Chris Earnshaw  wrote:
> Hi
>
> The problem is due to RDkit perceiving the embedded pyranone in
> CHEMBL1999443 as an aromatic system, which is probably correct. However, in
> the structure of aspirin the carboxyl carbon and singly bonded oxygen are
> non-aromatic, so if you just use the SMILES of aspirin as a query it won't
> match CHEMBL1999443
>
> You'll need to use a slightly more generic aspirin-like query to allow the
> possibility of matching both 'normal' aspirin and embedded aromatic
> analogues. CC(=O)Oc1c1[#6](=O)[#8] should work OK.
>
> Regards,
> Chris
>
> On 13 September 2017 at 13:40, Michał Nowotka  wrote:
>>
>> Hi,
>>
>> This problem is probably due to my lack of chemistry knowledge but
>> plese have a look:
>>
>> If I do a substructure search in ChEMBL using aspirin (CHEMBL25) as a
>> query (ChEMBL API uses the Symix catridge):
>>
>> from chembl_webresource_client.new_client import new_client
>> res = new_client.substructure.filter(chembl_id='CHEMBL25')
>>
>> One of them will be CHEMBL1999443:
>>
>> 'CHEMBL1999443' in (r['molecule_chembl_id'] for r in res)
>> >>> True
>>
>> Now I take the molfile:
>>
>> new_client.molecule.set_format('mol')
>> mol = new_client.molecule.get('CHEMBL1999443')
>>
>> and load it with aspirin into rdkit:
>>
>> from rdkit import Chem
>> m = Chem.MolFromMolBlock(mol)
>> pattern = Chem.MolFromMolBlock(new_client.molecule.get('CHEMBL25'))
>>
>> If I check if it has an aspirin as a substructure using rdkit, I'm
>> getting false...
>>
>> m.HasSubstructMatch(pattern)
>> >>> False
>>
>> Looking at this blog post:
>>
>> https://github.com/rdkit/rdkit-tutorials/blob/master/notebooks/002_SMARTS_SubstructureMatching.ipynb
>> I tried to initialize rings and retry:
>>
>>  Chem.GetSymmSSSR(m)
>>  m.HasSubstructMatch(pattern)
>>  >>>False
>>
>> Chem.GetSymmSSSR(pattern)
>> m.HasSubstructMatch(pattern)
>> >>>False
>>
>> But as you can see without any luck. Is there anything else I can do
>> to get the match anyway?
>> Without having a match I can't aligh and higlight asprin substructure
>> in CHEMBL1999443 image using GenerateDepictionMatching2DStructure and
>> DrawMolecule functions.
>>
>> Kind regards,
>>
>> Michał Nowotka
>>
>>
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Non-redundant database of molecules

2017-09-13 Thread Dimitri Maziuk

On 2017-09-13 05:13, Wandré wrote:

Compare if the SMILES as already inserted is easy (text compare), but, 
compare fingerprint of molecule...


Here's one option: http://alatis.nmrfam.wisc.edu/ -- you can use string 
comparison on the resulting inchi string.


Dima

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] HasSubstructMatch doesn't work as expected

2017-09-13 Thread Chris Earnshaw
Hi

The problem is due to RDkit perceiving the embedded pyranone in
CHEMBL1999443 as an aromatic system, which is probably correct. However, in
the structure of aspirin the carboxyl carbon and singly bonded oxygen are
non-aromatic, so if you just use the SMILES of aspirin as a query it won't
match CHEMBL1999443

You'll need to use a slightly more generic aspirin-like query to allow the
possibility of matching both 'normal' aspirin and embedded aromatic
analogues. CC(=O)Oc1c1[#6](=O)[#8] should work OK.

Regards,
Chris

On 13 September 2017 at 13:40, Michał Nowotka  wrote:

> Hi,
>
> This problem is probably due to my lack of chemistry knowledge but
> plese have a look:
>
> If I do a substructure search in ChEMBL using aspirin (CHEMBL25) as a
> query (ChEMBL API uses the Symix catridge):
>
> from chembl_webresource_client.new_client import new_client
> res = new_client.substructure.filter(chembl_id='CHEMBL25')
>
> One of them will be CHEMBL1999443:
>
> 'CHEMBL1999443' in (r['molecule_chembl_id'] for r in res)
> >>> True
>
> Now I take the molfile:
>
> new_client.molecule.set_format('mol')
> mol = new_client.molecule.get('CHEMBL1999443')
>
> and load it with aspirin into rdkit:
>
> from rdkit import Chem
> m = Chem.MolFromMolBlock(mol)
> pattern = Chem.MolFromMolBlock(new_client.molecule.get('CHEMBL25'))
>
> If I check if it has an aspirin as a substructure using rdkit, I'm
> getting false...
>
> m.HasSubstructMatch(pattern)
> >>> False
>
> Looking at this blog post:
> https://github.com/rdkit/rdkit-tutorials/blob/master/notebooks/002_SMARTS_
> SubstructureMatching.ipynb
> I tried to initialize rings and retry:
>
>  Chem.GetSymmSSSR(m)
>  m.HasSubstructMatch(pattern)
>  >>>False
>
> Chem.GetSymmSSSR(pattern)
> m.HasSubstructMatch(pattern)
> >>>False
>
> But as you can see without any luck. Is there anything else I can do
> to get the match anyway?
> Without having a match I can't aligh and higlight asprin substructure
> in CHEMBL1999443 image using GenerateDepictionMatching2DStructure and
> DrawMolecule functions.
>
> Kind regards,
>
> Michał Nowotka
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] HasSubstructMatch doesn't work as expected

2017-09-13 Thread Michał Nowotka
Hi,

This problem is probably due to my lack of chemistry knowledge but
plese have a look:

If I do a substructure search in ChEMBL using aspirin (CHEMBL25) as a
query (ChEMBL API uses the Symix catridge):

from chembl_webresource_client.new_client import new_client
res = new_client.substructure.filter(chembl_id='CHEMBL25')

One of them will be CHEMBL1999443:

'CHEMBL1999443' in (r['molecule_chembl_id'] for r in res)
>>> True

Now I take the molfile:

new_client.molecule.set_format('mol')
mol = new_client.molecule.get('CHEMBL1999443')

and load it with aspirin into rdkit:

from rdkit import Chem
m = Chem.MolFromMolBlock(mol)
pattern = Chem.MolFromMolBlock(new_client.molecule.get('CHEMBL25'))

If I check if it has an aspirin as a substructure using rdkit, I'm
getting false...

m.HasSubstructMatch(pattern)
>>> False

Looking at this blog post:
https://github.com/rdkit/rdkit-tutorials/blob/master/notebooks/002_SMARTS_SubstructureMatching.ipynb
I tried to initialize rings and retry:

 Chem.GetSymmSSSR(m)
 m.HasSubstructMatch(pattern)
 >>>False

Chem.GetSymmSSSR(pattern)
m.HasSubstructMatch(pattern)
>>>False

But as you can see without any luck. Is there anything else I can do
to get the match anyway?
Without having a match I can't aligh and higlight asprin substructure
in CHEMBL1999443 image using GenerateDepictionMatching2DStructure and
DrawMolecule functions.

Kind regards,

Michał Nowotka

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Non-redundant database of molecules

2017-09-13 Thread Wandré
Hi,

My name is Wandré and I'm from Brazil.
I'm trying to do a big database of molecules, but, I want to eliminate all
the redundant molecules before insert them in database.
I want to know what is the best method to identify one molecule in RDKit.
Is SMILES ("Chem.MolToSmiles(mol,isomericSmiles=True)") or I will need to
compare all molecules, one by one, before insert them in database (using
Tanimoto)?
This can be hard to do because my database will have lot of millions of
molecules, so, compare one by one before insert is the only answer?
Compare if the SMILES as already inserted is easy (text compare), but,
compare fingerprint of molecule...

If I really need to compare the fingerprint of molecule, how to store this
data in PostgreSQL without use cartridge? I will generate the fingeprint
(Atompair, for example) and store this fingerprint in database and compare
all the fingerprints, one by one, before insert a now molecule. This
fingerprint (Atompair) have lot of features, so, store this in relational
database is expensive.
It is possible?

Thanks!

--
Wandré Nunes de Pinho Veloso
Professor Assistente - Unifei - Campus Avançado de Itabira-MG
Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
Inteligência Computacional - UNIFEI
Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss