Dear Greg and Peter,
Thank you very much for your feedback and I am very sorry if my examples were
not clear enough. Please look at those below, provided in a format Greg
requested. I hope it helps in explaining what I mean.
Thanks a lot!
Best regards,
Janusz Petkowski
As an additional requirement for the results the (ringMatchesRingOnly and
completeRingsOnly methods are always applied in each case)
Example 1:
["CC=CNC", "C=CNC=CC"] ==> CC=CN
Example 2:
["CC(N)C(N)=O", "CC(N)C(=O)NC(C)C(=O)O"] ==> CC(N)C(N)=O
["CC(N)C(=O)O", "CC(N)C(=O)NC(C)C(=O)O"] ==> CC(N)C(=O)O
Example 3:
["C\C=C\N", "C\C=C\NC1CCC1"] ==> C/C=C/N
["CCCN", "CCCNC1CCC1"] ==> CCCN
["CCCN" ,"CCCNC1=CCC1"] ==> CCCN
Example 4:
["NC1CCC1", "C\C=C\NC1CCC1"] ==> NC1CCC1
Example 5:
["NC1=CCC1", "CCN=NC1=CCC1"] ==> C1CC=C1
Example 6:
["NC1=CCC1", "CC\C=N/C1=CCC1"] ==> C1CC=C1
["NC1=CCC1", "CC\C=N/C1CCC1"] ==> None
Example 7:
["CCC", "CC(C)=O"] ==> None
["CCC", "CC(C)O"] ==> CCC
["CCC", "CC(C)=N"] ==> None
["CCC", "CC(C)N"] ==> CCC
["CCC", "CCC=C=C"] ==> None
["C=C=C ", "CCC=C=C"] ==> C=C=C
Example 8:
["NC1CCC1" ," CN=C1CCC1"] ==> CCC (but if ringMatchesRingOnly and
completeRingsOnly methods are on at the same time ==> None)
________________________________
From: Peter Shenkin [[email protected]]
Sent: Sunday, November 15, 2015 2:44 PM
To: Janusz Petkowski
Cc: Greg Landrum; [email protected]
Subject: Re: [Rdkit-discuss] MCS module - bonding and hybridization in
substructure search
Say, Greg,
If you understand Janusz's request, could you perhaps explain it in other
words? I don't quite follow it, despite having read the two emails.
I'm getting the sense that he wants to make sure that SP2 nitrogens match only
SP2 nitrogens (for example). Is this right? I know OpenEye has an extension to
specify hybridization, but don't know whether RDKit has implemented something
like that; if not, a recursive SMARTS ought to be able to do it.
On Sun, Nov 15, 2015 at 10:55 AM, Janusz Petkowski
<[email protected]<mailto:[email protected]>> wrote:
Dear Greg,
Thank you very much for your reply. I will try to explain more what I would
like to achieve, I hope that it will clarify things a little.
Let's look at your example firs and let's treat the first molecule (CC=CNC) in
["CC=CNC", "C=CNC=CC"] as a "query", we would like to check if it is an EXACT
match to the second molecule ("C=CNC=CC").
Your example is a case of the "solution to the Liz Wylie problem" at its best.
["CC=CNC", "C=CNC=CC"] ==> CC=CN - so 'no' - no exact match! And it is what we
would expect upon the implementation of the current "solution to the Liz Wylie
problem" and this is what I would consider "CORRECT" for my purposes.
Tables below are as follows:
>>> bond_type, bond_start_atom, bond_start_atom_symbol, bond_start_atom_hyb,
>>> bond_end_atom, bond_end_atom_symbol, bond_end_atom_hyb
CC=CNC
SINGLE 0 C SP3 1 C SP2
DOUBLE 1 C SP2 2 C SP2
SINGLE 2 C SP2 3 N SP2
SINGLE 3 N SP2 4 C SP3
C=CNC=CC
DOUBLE 0 C SP2 1 C SP2
SINGLE 1 C SP2 2 N SP2
SINGLE 2 N SP2 3 C SP2
DOUBLE 3 C SP2 4 C SP2
SINGLE 4 C SP2 5 C SP3
In your example the hybridizations of C atoms in the CNC fragment of both
molecules do not match and the overall result is ok. In the first "query"
molecule the hybridization of the first C in the CNC fragment is sp2 (and it is
connected to the first C in the "query" molecule via the DOUBLE bond), then the
N is sp2, but the last C is sp3 and is bonded only via SINGLE bonds. In the
second molecule (C=CNC=CC) both carbons in CNC fragment are sp2 AND both
carbons are bonded via DOUBLE bonds, not like in the "query" molecule DOUBLE
and SINGLE.
What I would like to do is to check if one structure is an exact match within
the other, so the atoms must match, the bonds must match and the hybridization
of an atom must match, but the bonding is the most important thing and that is
where the exceptions show, because you can have an sp2 atom that is bonded via
a SINGLE bond. Let me illustrate on couple of examples what I mean.
Examples to illustrate it:
Example 1, Ala-Ala dipeptide case:
CC(N)C(=O)NC(C)C(=O)O
SINGLE 0 C SP3 1 C SP3
SINGLE 1 C SP3 2 N SP3
SINGLE 1 C SP3 3 C SP2
DOUBLE 3 C SP2 4 O SP2
SINGLE 3 C SP2 5 N SP2
SINGLE 5 N SP2 6 C SP3
SINGLE 6 C SP3 7 C SP3
SINGLE 6 C SP3 8 C SP2
SINGLE 8 C SP2 9 O SP2
DOUBLE 8 C SP2 10 O SP2
if I have two "query" molecules:
1) CC(N)C(N)=O
CC(N)C(N)=O
SINGLE 0 C SP3 1 C SP3
SINGLE 1 C SP3 2 N SP3
SINGLE 1 C SP3 3 C SP2
SINGLE 3 C SP2 4 N SP2
DOUBLE 3 C SP2 5 O SP2
["CC(N)C(N)=O", "CC(N)C(=O)NC(C)C(=O)O"] ==> CC(N)C(N)=O - so 'yes' - the exact
match! And "CORRECT!"
2) CC(N)C(O)=O
CC(N)C(=O)O
SINGLE 0 C SP3 1 C SP3
SINGLE 1 C SP3 2 N SP3
SINGLE 1 C SP3 3 C SP2
SINGLE 3 C SP2 4 O SP2
DOUBLE 3 C SP2 5 O SP2
["CC(N)C(=O)O", "CC(N)C(=O)NC(C)C(=O)O"] ==> CC(N)C=O - so 'no' - no exact
match! But it should be "CORRECT" because it is there.
I would like to check if the query molecules are EXACT match in the Ala-Ala
dipeptide case CC(N)C(=O)NC(C)C(=O)O then if we implement the current "solution
to the Liz Wylie problem" only the molecule 1) will be found there and the
molecule 2) will not be found in CC(N)C(=O)NC(C)C(=O)O due to the non-matching
hybridizations of the N atom. I very much need the "solution to the Liz Wylie
problem" to prevent matching atoms with different hybridizations but at the
same time I would like to ensure that if atom happens to be have sp2
hybridization but at the same time it is bonded by a single bond then its
hybridization state is less important and what really matters is its bonding.
Example 2:
C\C=C\NC1CCC1
CC=CNC1CCC1
SINGLE 0 C SP3 1 C SP2
DOUBLE 1 C SP2 2 C SP2
SINGLE 2 C SP2 3 N SP2
SINGLE 3 N SP2 4 C SP3
SINGLE 4 C SP3 5 C SP3
SINGLE 5 C SP3 6 C SP3
SINGLE 6 C SP3 7 C SP3
SINGLE 7 C SP3 4 C SP3
Two "query" molecules:
1) C\C=C\N
CC=CN
SINGLE 0 C SP3 1 C SP2
DOUBLE 1 C SP2 2 C SP2
SINGLE 2 C SP2 3 N SP2
["C\C=C\N", "C\C=C\NC1CCC1"] ==> C/C=C/N - so 'yes' - the exact match! And
"CORRECT!"
This is an easy example - everything matches between the "query" and the
molecule - the atoms, the bonding and the hybridization.
2) NC1CCC1
NC1CCC1
SINGLE 0 N SP3 1 C SP3
SINGLE 1 C SP3 2 C SP3
SINGLE 2 C SP3 3 C SP3
SINGLE 3 C SP3 4 C SP3
SINGLE 4 C SP3 1 C SP3
["NC1CCC1", "C\C=C\NC1CCC1"] ==> C1CCC1 - so 'no' - no exact match! But it
should be "CORRECT"
What does not match is the hybridization of the N atom between the "query" and
the C\C=C\NC1CCC1 molecule and that is true, but in both "query" and the
C\C=C\NC1CCC1 molecules the N atom bond types match and both N atoms are bonded
with SINGLE bonds. The bonding match, for me, is of higher order importance
then the hybridization match.
Example 3:
The last example is an illustration of a hierarchical importance of matching I
need. It is an example when everything matches but the result is "INCORRECT".
CC\N=N\C1=CCC1
CCN=NC1=CCC1
SINGLE 0 C SP3 1 C SP3
SINGLE 1 C SP3 2 N SP2
DOUBLE 2 N SP2 3 N SP2
SINGLE 3 N SP2 4 C SP2
DOUBLE 4 C SP2 5 C SP2
SINGLE 5 C SP2 6 C SP3
SINGLE 6 C SP3 7 C SP3
SINGLE 7 C SP3 4 C SP2
One "query" molecule:
1) NC1=CCC1
NC1=CCC1
SINGLE 0 N SP2 1 C SP2
DOUBLE 1 C SP2 2 C SP2
SINGLE 2 C SP2 3 C SP3
SINGLE 3 C SP3 4 C SP3
SINGLE 4 C SP3 1 C SP2
["NC1=CCC1", "CCN=NC1=CCC1"] ==> NC1=CCC1 - so 'yes' - exact match! But it is
"INCORRECT".
Why? Even if the hybridizations of N atoms in the "query" and in the
CCN=NC1=CCC1 is sp2, both N atoms in the CCN=NC1=CCC1 molecule are DOUBLE
bonded and the N atom in the "query" molecule is SINGLE bonded, so the bonding
does not match and as I mentioned earlier the bonding has higher order of
importance than the hybridization.
I hope that that this clarifies what I would like to achieve, I know that it is
probably highly non-standard problem and an unique one, but I would really
appreciate your help with that matter! Of course the examples I gave are purely
for computational purposes and they do not reflect the chemical stability of
those molecules.
Thanks a lot once again!
Have a great Sunday!
Janusz Petkowski
________________________________
From: Greg Landrum [[email protected]<mailto:[email protected]>]
Sent: Saturday, November 14, 2015 11:26 PM
To: Janusz Petkowski
Cc:
[email protected]<mailto:[email protected]>
Subject: Re: [Rdkit-discuss] MCS module - bonding and hybridization in
substructure search
Hi Janusz,
I'm not 100% sure what you're looking for, but I think it has something to do
with including information about bond conjugation in the MCS procedure.
To confirm, can you please give a couple of examples of what you would like to
have as output from the algorithm? Something like this with the input molecules
on the left and the desired result on the right would help :
['CNC=CC', 'C=CNC=CC'] -> 'CNC=CC'
(I realize that specific example is not what you're looking for, it's just
intended to be an example)
Once I've seen that I can try to figure out if it is currently doable and, if
not, if it's possible to modify the code to support it.
Best,
-greg
On Fri, Nov 13, 2015 at 9:17 PM, Janusz Petkowski
<[email protected]<mailto:[email protected]>> wrote:
Dear RDKit Community,
I am looking for a way to use MCS module in RDKit to compare atoms and bonding
of two molecules which will also take under consideration the hybridization of
an atom.
The solution to similar problem was suggested before, (Inspired by this
RDKit-discuss thread started by Liz Wylie:
http://www.mail-archive.com/[email protected]/msg03676.html
and see here http://sourceforge.net/p/rdkit/mailman/message/31830412/ )
but even if it is computationally correct it does not necessarily mirror some
nuances of chemistry and one may want to modify it in certain specific cases.
While it works most of the time for cases like those proposed in the solution
of Liz Wylie case:
smis = ['CC(C)=C','CC(C)C']
or
smis2 = ['CC(C)=C','CC(C)=N']
If we check if 'CCC' substructure is present in molecules from those two data
sets upon implementation of Greg Landrum solution to CCC will be found only in
'CC(C)C', taking in to the account the atoms, the bonding and the hybridization
of the atoms. It is all correct and cool!
But let's look at the other example:
Let's look for the N\CC\N substructure in 'C\C=C\NCCN\C=C\C' or the 'NCN'
substructure in NCN-C=C or ' C=CNCNC=C'. It will not be found there even if
"structurally speaking" it is there.
The problem is as follows: an electronegative atom next to a C=C bond will
pull electron density from that bond and so the N-C bond in NCN-C=C will have a
‘bit of’ double bond character, even if technically it is a single bond. The
current solution to the Liz Wylie problem does not ignore that and
distinguishes between regular N-C bond and an N-C bond next to C=C bond (like
in NCN-C=C, because of that it will not find NCN in this structure). NCS in
NCSC=C is matched because the S bond is more electropositive than N or O and so
does not have that double-bond character. My question to the RDKit community
is: How to modify Greg Landrum solution to Liz Wylie case to successfully match
such cases I mentioned above, while still retaining the hybridization check (we
do want to have hybridization match, we just want the bonding to be more
important). The problem is that the atoms that are not matched like the N atoms
above have sp2 hybridization but technically are bonded by single bonds from
all sides.
Thanks a lot for your help, time and consideration. This is my first post on
RDKit forum, I am new to RDKit and python in general, so I apologize if I
anything is not clear.
I would really appreciate your help!
Best regards,
Janusz Petkowski
------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
------------------------------------------------------------------------------
Presto, an open source distributed SQL query engine for big data, initially
developed by Facebook, enables you to easily query your data on Hadoop in a
more interactive manner. Teradata is also now providing full enterprise
support for Presto. Download a free open source copy now.
http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss