Re: [Rdkit-discuss] MCS module - bonding and hybridization in substructure search

Janusz Petkowski Mon, 16 Nov 2015 06:56:15 -0800

Dear Greg and Peter,

Thank you very much for your feedback and I am very sorry if my examples were 
not clear enough. Please look at those below, provided in a format Greg 
requested. I hope it helps in explaining what I mean.

Thanks a lot!

Best regards,

Janusz Petkowski

As an additional requirement for the results the (ringMatchesRingOnly and 
completeRingsOnly methods are always applied in each case)

Example 1:

["CC=CNC", "C=CNC=CC"] ==> CC=CN

Example 2:

["CC(N)C(N)=O", "CC(N)C(=O)NC(C)C(=O)O"] ==> CC(N)C(N)=O
["CC(N)C(=O)O", "CC(N)C(=O)NC(C)C(=O)O"] ==> CC(N)C(=O)O

Example 3:

["C\C=C\N", "C\C=C\NC1CCC1"] ==> C/C=C/N
["CCCN", "CCCNC1CCC1"] ==> CCCN
["CCCN" ,"CCCNC1=CCC1"] ==> CCCN

Example 4:

["NC1CCC1", "C\C=C\NC1CCC1"] ==> NC1CCC1

Example 5:

["NC1=CCC1", "CCN=NC1=CCC1"] ==> C1CC=C1

Example 6:

["NC1=CCC1", "CC\C=N/C1=CCC1"]  ==>  C1CC=C1
["NC1=CCC1", "CC\C=N/C1CCC1"] ==> None

Example 7:

["CCC", "CC(C)=O"] ==> None
["CCC", "CC(C)O"] ==> CCC
["CCC", "CC(C)=N"] ==> None
["CCC", "CC(C)N"] ==> CCC
["CCC", "CCC=C=C"] ==> None
["C=C=C ", "CCC=C=C"] ==> C=C=C

Example 8:

["NC1CCC1" ," CN=C1CCC1"] ==> CCC (but if ringMatchesRingOnly and 
completeRingsOnly methods are on at the same time ==> None)

________________________________
From: Peter Shenkin [[email protected]]
Sent: Sunday, November 15, 2015 2:44 PM
To: Janusz Petkowski
Cc: Greg Landrum; [email protected]
Subject: Re: [Rdkit-discuss] MCS module - bonding and hybridization in 
substructure search

Say, Greg,

If you understand Janusz's request, could you perhaps explain it in other 
words? I don't quite follow it, despite having read the two emails.

I'm getting the sense that he wants to make sure that SP2 nitrogens match only 
SP2 nitrogens (for example). Is this right? I know OpenEye has an extension to 
specify hybridization, but don't know whether RDKit has implemented something 
like that; if not, a recursive SMARTS ought to be able to do it.

On Sun, Nov 15, 2015 at 10:55 AM, Janusz Petkowski 
<[email protected]<mailto:[email protected]>> wrote:
Dear Greg,

Thank you very much for your reply. I will try to explain more what I would 
like to achieve, I hope that it will clarify things a little.

Let's look at your example firs and let's treat the first molecule (CC=CNC) in 
["CC=CNC", "C=CNC=CC"] as a "query", we would like to check if it is an EXACT 
match to the second molecule ("C=CNC=CC").

Your example is a case of the "solution to the Liz Wylie problem" at its best.

["CC=CNC", "C=CNC=CC"] ==> CC=CN - so 'no' - no exact match! And it is what we 
would expect upon the implementation of the current "solution to the Liz Wylie 
problem" and this is what I would consider "CORRECT" for my purposes.
Tables below are as follows:
>>> bond_type, bond_start_atom, bond_start_atom_symbol, bond_start_atom_hyb, 
>>> bond_end_atom, bond_end_atom_symbol, bond_end_atom_hyb

CC=CNC
SINGLE 0 C SP3 1 C SP2
DOUBLE 1 C SP2 2 C SP2
SINGLE 2 C SP2 3 N SP2
SINGLE 3 N SP2 4 C SP3

C=CNC=CC

DOUBLE 0 C SP2 1 C SP2
SINGLE 1 C SP2 2 N SP2
SINGLE 2 N SP2 3 C SP2
DOUBLE 3 C SP2 4 C SP2
SINGLE 4 C SP2 5 C SP3

In your example the hybridizations of C atoms in the CNC fragment of both 
molecules do not match and the overall result is ok. In the first "query" 
molecule the hybridization of the first C in the CNC fragment is sp2 (and it is 
connected to the first C in the "query" molecule via the DOUBLE bond), then the 
N is sp2, but the last C is sp3 and is bonded only via SINGLE bonds. In the 
second molecule (C=CNC=CC) both carbons in CNC fragment are sp2 AND both 
carbons are bonded via DOUBLE bonds, not like in the "query" molecule DOUBLE 
and SINGLE.
What I would like to do is to check if one structure is an exact match within 
the other, so the atoms must match, the bonds must match and the hybridization 
of an atom must match, but the bonding is the most important thing and that is 
where the exceptions show, because you can have an sp2 atom that is bonded via 
a SINGLE bond. Let me illustrate on couple of examples what I mean.

Examples to illustrate it:

Example 1, Ala-Ala dipeptide case:

CC(N)C(=O)NC(C)C(=O)O

SINGLE 0 C SP3 1 C SP3
SINGLE 1 C SP3 2 N SP3
SINGLE 1 C SP3 3 C SP2
DOUBLE 3 C SP2 4 O SP2
SINGLE 3 C SP2 5 N SP2
SINGLE 5 N SP2 6 C SP3
SINGLE 6 C SP3 7 C SP3
SINGLE 6 C SP3 8 C SP2
SINGLE 8 C SP2 9 O SP2
DOUBLE 8 C SP2 10 O SP2

if I have two "query" molecules:

1) CC(N)C(N)=O
CC(N)C(N)=O
SINGLE 0 C SP3 1 C SP3
SINGLE 1 C SP3 2 N SP3
SINGLE 1 C SP3 3 C SP2
SINGLE 3 C SP2 4 N SP2
DOUBLE 3 C SP2 5 O SP2

["CC(N)C(N)=O", "CC(N)C(=O)NC(C)C(=O)O"] ==> CC(N)C(N)=O - so 'yes' - the exact 
match! And "CORRECT!"
2) CC(N)C(O)=O
CC(N)C(=O)O
SINGLE 0 C SP3 1 C SP3
SINGLE 1 C SP3 2 N SP3
SINGLE 1 C SP3 3 C SP2
SINGLE 3 C SP2 4 O SP2
DOUBLE 3 C SP2 5 O SP2
["CC(N)C(=O)O", "CC(N)C(=O)NC(C)C(=O)O"] ==> CC(N)C=O - so 'no' - no exact 
match! But it should be "CORRECT" because it is there.

I would like to check if the query molecules are EXACT match in the Ala-Ala 
dipeptide case CC(N)C(=O)NC(C)C(=O)O then if we implement the current "solution 
to the Liz Wylie problem" only the molecule 1) will be found there and the 
molecule 2) will not be found in CC(N)C(=O)NC(C)C(=O)O due to the non-matching 
hybridizations of the N atom. I very much need the "solution to the Liz Wylie 
problem" to prevent matching atoms with different hybridizations but at the 
same time I would like to ensure that if atom happens to be have sp2 
hybridization but at the same time it is bonded by a single bond then its 
hybridization state is less important and what really matters is its bonding.

Example 2:

C\C=C\NC1CCC1
CC=CNC1CCC1
SINGLE 0 C SP3 1 C SP2
DOUBLE 1 C SP2 2 C SP2
SINGLE 2 C SP2 3 N SP2
SINGLE 3 N SP2 4 C SP3
SINGLE 4 C SP3 5 C SP3
SINGLE 5 C SP3 6 C SP3
SINGLE 6 C SP3 7 C SP3
SINGLE 7 C SP3 4 C SP3

Two "query" molecules:

1) C\C=C\N
CC=CN
SINGLE 0 C SP3 1 C SP2
DOUBLE 1 C SP2 2 C SP2
SINGLE 2 C SP2 3 N SP2

["C\C=C\N", "C\C=C\NC1CCC1"] ==> C/C=C/N - so 'yes' - the exact match! And 
"CORRECT!"

This is an easy example - everything matches between the "query" and the 
molecule - the atoms, the bonding and the hybridization.

2) NC1CCC1
NC1CCC1
SINGLE 0 N SP3 1 C SP3
SINGLE 1 C SP3 2 C SP3
SINGLE 2 C SP3 3 C SP3
SINGLE 3 C SP3 4 C SP3
SINGLE 4 C SP3 1 C SP3
["NC1CCC1", "C\C=C\NC1CCC1"] ==> C1CCC1 - so 'no' - no exact match! But it 
should be "CORRECT"

What does not match is the hybridization of the N atom between the "query" and 
the C\C=C\NC1CCC1 molecule and that is true, but in both "query" and the 
C\C=C\NC1CCC1 molecules the N atom bond types match and both N atoms are bonded 
with SINGLE bonds. The bonding match, for me, is of higher order importance 
then the hybridization match.

Example 3:

The last example is an illustration of a hierarchical importance of matching I 
need. It is an example when everything matches but the result is "INCORRECT".

CC\N=N\C1=CCC1
CCN=NC1=CCC1
SINGLE 0 C SP3 1 C SP3
SINGLE 1 C SP3 2 N SP2
DOUBLE 2 N SP2 3 N SP2
SINGLE 3 N SP2 4 C SP2
DOUBLE 4 C SP2 5 C SP2
SINGLE 5 C SP2 6 C SP3
SINGLE 6 C SP3 7 C SP3
SINGLE 7 C SP3 4 C SP2

One "query" molecule:

1) NC1=CCC1

NC1=CCC1

SINGLE 0 N SP2 1 C SP2
DOUBLE 1 C SP2 2 C SP2
SINGLE 2 C SP2 3 C SP3
SINGLE 3 C SP3 4 C SP3
SINGLE 4 C SP3 1 C SP2

["NC1=CCC1", "CCN=NC1=CCC1"] ==> NC1=CCC1 - so 'yes' - exact match! But it is 
"INCORRECT".

Why? Even if the hybridizations of N atoms in the "query" and in the 
CCN=NC1=CCC1 is sp2, both N atoms in the CCN=NC1=CCC1 molecule are DOUBLE 
bonded and the N atom in the "query" molecule is SINGLE bonded, so the bonding 
does not match and as I mentioned earlier the bonding has higher order of 
importance than the hybridization.

I hope that that this clarifies what I would like to achieve, I know that it is 
probably highly non-standard problem and an unique one, but I would really 
appreciate your help with that matter! Of course the examples I gave are purely 
for computational purposes and they do not reflect the chemical stability of 
those molecules.
Thanks a lot once again!
Have a great Sunday!
Janusz Petkowski

________________________________
From: Greg Landrum [[email protected]<mailto:[email protected]>]
Sent: Saturday, November 14, 2015 11:26 PM
To: Janusz Petkowski
Cc: 
[email protected]<mailto:[email protected]>
Subject: Re: [Rdkit-discuss] MCS module - bonding and hybridization in 
substructure search

Hi Janusz,

I'm not 100% sure what you're looking for, but I think it has something to do 
with including information about bond conjugation in the MCS procedure.

To confirm, can you please give a couple of examples of what you would like to 
have as output from the algorithm? Something like this with the input molecules 
on the left and the desired result on the right would help :
['CNC=CC', 'C=CNC=CC'] -> 'CNC=CC'
(I realize that specific example is not what you're looking for, it's just 
intended to be an example)

Once I've seen that I can try to figure out if it is currently doable and, if 
not, if it's possible to modify the code to support it.

Best,
-greg

On Fri, Nov 13, 2015 at 9:17 PM, Janusz Petkowski 
<[email protected]<mailto:[email protected]>> wrote:
Dear RDKit Community,

I am looking for a way to use MCS module in RDKit to compare atoms and bonding 
of two molecules which will also take under consideration the hybridization of 
an atom.
The solution to similar problem was suggested before, (Inspired by this 
RDKit-discuss thread started by Liz Wylie: 
http://www.mail-archive.com/[email protected]/msg03676.html 
and see here http://sourceforge.net/p/rdkit/mailman/message/31830412/ )

but even if it is computationally correct it does not necessarily mirror some 
nuances of chemistry and one may want to modify it in certain specific cases.
While it works most of the time for cases like those proposed in the solution 
of Liz Wylie case:

smis = ['CC(C)=C','CC(C)C']
 or

smis2 = ['CC(C)=C','CC(C)=N']
 If we check if 'CCC' substructure is present in molecules from those two data 
sets upon implementation of Greg Landrum solution to CCC will be found only in  
'CC(C)C', taking in to the account the atoms, the bonding and the hybridization 
of the atoms. It is all correct and cool!

But let's look at the other example:
Let's look for the N\CC\N substructure in 'C\C=C\NCCN\C=C\C' or the 'NCN' 
substructure in NCN-C=C or ' C=CNCNC=C'. It will not be found there even if 
"structurally speaking" it is there.
The problem is as follows:  an electronegative atom next to a C=C bond will 
pull electron density from that bond and so the N-C bond in NCN-C=C will have a 
‘bit of’ double bond character, even if technically it is a single bond. The 
current solution to the Liz Wylie problem does not ignore that and 
distinguishes between regular N-C bond and an N-C bond next to C=C bond (like 
in NCN-C=C, because of that it will not find NCN in this structure). NCS in 
NCSC=C is matched because the S bond is more electropositive than N or O and so 
does not have that double-bond character. My question to the RDKit community 
is: How to modify Greg Landrum solution to Liz Wylie case to successfully match 
such cases I mentioned above, while still retaining the hybridization check (we 
do want to have hybridization match, we just want the bonding to be more 
important). The problem is that the atoms that are not matched like the N atoms 
above have sp2 hybridization but technically are bonded by single bonds from 
all sides.
Thanks a lot for your help, time and consideration. This is my first post on 
RDKit forum, I am new to RDKit and python in general, so I apologize if I 
anything is not clear.
I would really appreciate your help!

Best regards,

Janusz Petkowski

------------------------------------------------------------------------------

_______________________________________________
Rdkit-discuss mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

------------------------------------------------------------------------------

_______________________________________________
Rdkit-discuss mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

------------------------------------------------------------------------------
Presto, an open source distributed SQL query engine for big data, initially
developed by Facebook, enables you to easily query your data on Hadoop in a 
more interactive manner. Teradata is also now providing full enterprise
support for Presto. Download a free open source copy now.
http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] MCS module - bonding and hybridization in substructure search

Reply via email to