Re: [Rdkit-discuss] Generation of stereo-isomers

2015-09-24 Thread Peter Shenkin
Umm... would that be all stereoisomers or all realizable stereoisomers? For
example consider two bridgeheads in a norbonane-type compound. In this
case, only a particular enantiomeric pair would be realizable, and not all
four diastereomers.
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] cis/trans directional bond and smiles strings in python

2015-10-14 Thread Peter Shenkin
FWIW, this makes sense to me. To the extent that RDKit can recognize an
invalid SMARTS or SMILES and throw an exception for it, the user is
protected from some classes of error.

On Wed, Oct 14, 2015 at 10:39 AM, Rocco Moretti 
wrote:
>
> Would raising an error (or warning) be appropriate here? The SMILES parser
> is getting a two line string, but is only using the first.
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] MCS module - bonding and hybridization in substructure search

2015-11-15 Thread Peter Shenkin
Say, Greg,

If you understand Janusz's request, could you perhaps explain it in other
words? I don't quite follow it, despite having read the two emails.

I'm getting the sense that he wants to make sure that SP2 nitrogens match
only SP2 nitrogens (for example). Is this right? I know OpenEye has an
extension to specify hybridization, but don't know whether RDKit has
implemented something like that; if not, a recursive SMARTS ought to be
able to do it.

On Sun, Nov 15, 2015 at 10:55 AM, Janusz Petkowski  wrote:

> Dear Greg,
>
>
>
> Thank you very much for your reply. I will try to explain more what I
> would like to achieve, I hope that it will clarify things a little.
>
>
> Let's look at your example firs and let's treat the first molecule
> (CC=CNC) in ["CC=CNC", "C=CNC=CC"] as a "query", we would like to check if
> it is an EXACT match to the second molecule ("C=CNC=CC").
>
>
> Your example is a case of the "solution to the Liz Wylie problem" at its
> best.
>
>
> ["CC=CNC", "C=CNC=CC"] ==> CC=CN - so 'no' - no exact match! And it is
> what we would expect upon the implementation of the current "solution to
> the Liz Wylie problem" and this is what I would consider "CORRECT" for my
> purposes.
>
> Tables below are as follows:
>
> >>> bond_type, bond_start_atom, bond_start_atom_symbol,
> bond_start_atom_hyb, bond_end_atom, bond_end_atom_symbol, bond_end_atom_hyb
>
>
>
> CC=CNC
>
> SINGLE 0 C SP3 1 C SP2
>
> DOUBLE 1 C SP2 2 C SP2
>
> SINGLE 2 C SP2 3 N SP2
>
> SINGLE 3 N SP2 4 C SP3
>
>
>
> C=CNC=CC
>
>
>
> DOUBLE 0 C SP2 1 C SP2
>
> SINGLE 1 C SP2 2 N SP2
>
> SINGLE 2 N SP2 3 C SP2
>
> DOUBLE 3 C SP2 4 C SP2
>
> SINGLE 4 C SP2 5 C SP3
>
>
>
> In your example the hybridizations of C atoms in the CNC fragment of both
> molecules do not match and the overall result is ok. In the first "query"
> molecule the hybridization of the first C in the CNC fragment is sp2 (and
> it is connected to the first C in the "query" molecule via the DOUBLE
> bond), then the N is sp2, but the last C is sp3 and is bonded only via
> SINGLE bonds. In the second molecule (C=CNC=CC) both carbons in CNC
> fragment are sp2 AND both carbons are bonded via DOUBLE bonds, not like in
> the "query" molecule DOUBLE and SINGLE.
>
> What I would like to do is to check if one structure is an exact match
> within the other, so the atoms must match, the bonds must match and the
> hybridization of an atom must match, but the bonding is the most important
> thing and that is where the exceptions show, because you can have an sp2
> atom that is bonded via a SINGLE bond. Let me illustrate on couple of
> examples what I mean.
>
>
> Examples to illustrate it:
>
>
> Example 1, Ala-Ala dipeptide case:
>
>
>
> CC(N)C(=O)NC(C)C(=O)O
>
>
>
> SINGLE 0 C SP3 1 C SP3
>
> SINGLE 1 C SP3 2 N SP3
>
> SINGLE 1 C SP3 3 C SP2
>
> DOUBLE 3 C SP2 4 O SP2
>
> SINGLE 3 C SP2 5 N SP2
>
> SINGLE 5 N SP2 6 C SP3
>
> SINGLE 6 C SP3 7 C SP3
>
> SINGLE 6 C SP3 8 C SP2
>
> SINGLE 8 C SP2 9 O SP2
>
> DOUBLE 8 C SP2 10 O SP2
>
>
>
> if I have two "query" molecules:
>
>
> 1) CC(N)C(N)=O
>
> CC(N)C(N)=O
>
> SINGLE 0 C SP3 1 C SP3
>
> SINGLE 1 C SP3 2 N SP3
>
> SINGLE 1 C SP3 3 C SP2
>
> SINGLE 3 C SP2 4 N SP2
>
> DOUBLE 3 C SP2 5 O SP2
>
>
>
> ["CC(N)C(N)=O", "CC(N)C(=O)NC(C)C(=O)O"] ==> CC(N)C(N)=O - so 'yes' - the
> exact match! And "CORRECT!"
>
> 2) CC(N)C(O)=O
>
> CC(N)C(=O)O
>
> SINGLE 0 C SP3 1 C SP3
>
> SINGLE 1 C SP3 2 N SP3
>
> SINGLE 1 C SP3 3 C SP2
>
> SINGLE 3 C SP2 4 O SP2
>
> DOUBLE 3 C SP2 5 O SP2
>
> ["CC(N)C(=O)O", "CC(N)C(=O)NC(C)C(=O)O"] ==> CC(N)C=O - so 'no' - no exact
> match! But it should be "CORRECT" because it is there.
>
>
> I would like to check if the query molecules are EXACT match in the
> Ala-Ala dipeptide case CC(N)C(=O)NC(C)C(=O)O then if we implement the
> current "solution to the Liz Wylie problem" only the molecule 1) will be
> found there and the molecule 2) will not be found in CC(N)C(=O)NC(C)C(=O)O
> due to the non-matching hybridizations of the N atom. I very much need the
> "solution to the Liz Wylie problem" to prevent matching atoms with
> different hybridizations but at the same time I would like to ensure that
> if atom happens to be have sp2 hybridization but at the same time it is
> bonded by a single bond then its hybridization state is less important and
> what really matters is its bonding.
>
>
>
> Example 2:
>
>
> C\C=C\NC1CCC1
>
> CC=CNC1CCC1
>
> SINGLE 0 C SP3 1 C SP2
>
> DOUBLE 1 C SP2 2 C SP2
>
> SINGLE 2 C SP2 3 N SP2
>
> SINGLE 3 N SP2 4 C SP3
>
> SINGLE 4 C SP3 5 C SP3
>
> SINGLE 5 C SP3 6 C SP3
>
> SINGLE 6 C SP3 7 C SP3
>
> SINGLE 7 C SP3 4 C SP3
>
>
>
> Two "query" molecules:
>
>
> 1) C\C=C\N
>
> CC=CN
>
> SINGLE 0 C SP3 1 C SP2
>
> DOUBLE 1 C SP2 2 C SP2
>
> SINGLE 2 C SP2 3 N SP2
>
>
> ["C\C=C\N", "C\C=C\NC1CCC1"] ==> C/C=C/N - so 'yes' - the exact match! And
> "CORRECT!"
>
>
> This is an easy example - everything matches between the "query" and the
> molecule - the atoms, the bonding and the hybridization.
>
>
> 2)

[Rdkit-discuss] The Chlorine molfile question

2016-01-20 Thread Peter Shenkin
It seems to me that what we are talking about now has (or should have!)
more to do with the interpretation of the terrible old PDB file format than
about any software convention.

It seems to me that software that must read this format should turn the
contents into something generally chemically acceptable (that is, "Cl", not
"CL", in this case) rather than foolishly propagating the error, or
accepting it in other contexts.

As for those who would write that format, fight it! :-)

The above, in my view, represents the voice of reason, and is therefore
unlikely to be generally adopted

-P.

On Wed, Jan 20, 2016 at 10:42 AM, John M 
wrote:

> Whoops wrong thread this was in regard to the Chlorine molfile question.
>
> Regards,
> John W May
> john.wilkinson...@gmail.com
>
> On 20 January 2016 at 15:40, John M  wrote:
>
>> The joys of the molfile - was curious whether it was accepted/correctly
>> interpreted:
>>
>> ISIS Draw 2.5 Yes (arguably the arbitrator of the format)
>> ChemDraw 15 Yes
>> ChemDoodle No (accepted but only as a text label 'CL' no conversion)
>> MarvinSketch Yes
>> CDK Yes
>> OEChem Yes
>> Open Babel Yes
>> Indigo Yes
>>
>> J
>>
>> Regards,
>> John W May
>> john.wilkinson...@gmail.com
>>
>> On 20 January 2016 at 13:19, Greg Landrum  wrote:
>>
>>> Hi Joos,
>>>
>>> As long as you are sure to be consistent, it is certainly ok to generate
>>> fingerprints for molecules with Hs still attached, but it's very easy to
>>> make a mistake.
>>>
>>> The default behavior of the RDKit is to remove Hs. This is what I would
>>> recommend before doing things like generating fingerprints or descriptors.
>>>
>>>
>>> -greg
>>>
>>>
>>> On Wed, Jan 20, 2016 at 7:06 AM, Joos Kiener 
>>> wrote:
>>>
 Hi all,

 I've been looking at different Fingerprints within the RDKit when I
 realized, that it matters  for many of them whether Hydrogens are
 explicitly present or not. This probably was obvious and clear for many of
 you but I wasn't aware of that.

 To visualize what I mean please see below notebook:


 http://nbviewer.jupyter.org/github/kienerj/notebooks/blob/master/Fingerprint%20Similarity%20-%20with%20and%20without%20hydrogens.ipynb

 Now my questions are:

 Should I always add hydrogens before generating fingerprints or should
 I remove them?

 How is this handled in KNIME nodes? Do I need to perform the according
 action (add/remove H) before generating the fingerprint? Or is this done
 correctly already internally of the node?

 Thank you for your help.

 Best Regards,

 Joos


 --
 Site24x7 APM Insight: Get Deep Visibility into Application Performance
 APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
 Monitor end-to-end web transactions and take corrective actions now
 Troubleshoot faster and improve end-user experience. Signup Now!
 http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


>>>
>>>
>>> --
>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>> Monitor end-to-end web transactions and take corrective actions now
>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>> http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>
>
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140___
Rd

[Rdkit-discuss] SMILES: Why are rings consisting of wildcards assumed to be aromatic?

2015-06-11 Thread Peter Shenkin
If I canonicalize *1**1 in RDKit, I get  [*]1:[*]:[*]:1.

I expected [*]1[*][*]1.

I can think of no reason that the wildcard type in this context should be
assumed to be aromatic.

Indeed, ** is canonicalized as [*][*], demonstrating that RDKit does not in
general require wildcards to be aromatic. (Else I'd have expected some sort
of error message rejecting the input.)

Though some other SMILES implementations do, I believe, make the same
assumption, others do not, and again, I do not think it is justified.

Thanks,
-P.
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SMILES: Why are rings consisting of wildcards assumed to be aromatic?

2015-06-15 Thread Peter Shenkin
On Mon, Jun 15, 2015 at 9:54 AM, Greg Landrum 
wrote:
>
> On Thu, Jun 11, 2015 at 5:54 PM, Peter Shenkin  wrote:
>
>> If I canonicalize *1**1 in RDKit, I get  [*]1:[*]:[*]:1.
>>
>> I expected [*]1[*][*]1.
>>
> ...
>
> This is certainly a bug and I've put it on the list of stuff to fix:
> https://github.com/rdkit/rdkit/issues/518
>
>
Thanks much, Greg. That is very helpful.

Pursuing another remark you made, RDKit canonicalizes C1=C*C=C1
as [*]11. This may also be unwarranted, because the wildcard could be
another C, in which case the structure would not be aromatic.

-P.
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SMILES: Why are rings consisting of wildcards assumed to be aromatic?

2015-06-15 Thread Peter Shenkin
By the way, lest I appear ungrateful, I'd like to thank Greg/RDKit for
making the inter-ring bonds in biphenylene single, rather than
aromatic, in the unique SMILES. This is something that several other
kits of my acquaintance get wrong

-P.

--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SMILES: Why are rings consisting of wildcards assumed to be aromatic?

2015-06-16 Thread Peter Shenkin
Actually, Greg, I think there is a way to win and I think you have in fact won.

I was asking myself what behavior one would actually require of a
canonical SMILES containing wildcards.

What I came up with is: if you replace the wildcards in the canonical
SMILES with any atoms that could result in a legal structure, you
should be able to recanonicalize the result to a legal SMILES.

For the current example, [*]11 where the wildcard is a C, C11
canonicalizes to CC1=CC=C1, which meets the criterion. So, to me, this
is in fact a win. If it had resulted in an error (because the starting
SMILES contains aromatic atoms, but cannot be aromatic), I'd have
regarded it as a loss.

(By the above criterion, a kit that did regard C11 as illegal
would have to canonicalize *11 as *C1=CC=C1 or something similar.)

Anyway, I apologize for getting rather arcane here. Separately, I
think I have found an example of two equivalent SMILES for a real
molecule (no wildcards) that canonicalize differently in RDKit. I'll
start a separate thread for this.

-P.


On Tue, Jun 16, 2015 at 12:36 AM, Greg Landrum  wrote:
> On Mon, Jun 15, 2015 at 6:11 PM, Peter Shenkin  wrote:
...
>> Pursuing another remark you made, RDKit canonicalizes C1=C*C=C1 as
>> [*]11. This may also be unwarranted, because the wildcard could be
>> another C, in which case the structure would not be aromatic.
>
>
> That's correct.[1] There's not really a right answer when treating molecules
> with query features as "real" molecules, this is just the convention that
> the RDKit takes when canonicalizing structures containing dummy atoms.
>
> -greg
> [1] Though, technically, if the * is a [CH-], then the ring would be
> aromatic again. There's no way to win here. :-)

--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Two SMILES that (I think) should canonicalize to the same thing, but don't

2015-06-16 Thread Peter Shenkin
[N-]=[N+]=NC(=O)N1C(=O)N([N+]([O-])=O)C2(C13C4=C56)C4=C5C2=C36
[N-]=[N+]=NC(=O)N(C(=O)N1[N+]([O-])=O)C(c23)(c4c56)C16c3c5c24

rdkit canonicalizes the two to the following, respectively:

[N-]=[N+]=NC(=O)N1C(=O)N([N+](=O)[O-])C23c4c5c2c2c-5c4C213
[N-]=[N+]=NC(=O)N1C(=O)N([N+](=O)[O-])C23c4c5c6c(c2c4=6)C513

I believe these represent the same structure, with the following caveat:

It is not impossible that the two SMILES actually code for different
structures in some subtle way. I've tried visualizing them in several
packages, however, and I've not been able to find a difference. Some
packages canonicalize them to the same structure and others do not.
The actual structure is chiral, but I've been looking at this from the
point of view of SMILES without stereochemical information.

The two original SMILES come from a different package. That package
puts them out as SMILES which are dependent on the atom numbering in
the input structure file. The originating package does canonicalize
these to the same structure, however.

I don't think it is correct to consider the double-bonded atoms
aromatic, which the originating package does in one case. However,
FWIW, RDKit canonicalizes them as aromatic in both cases. But the main
issue is that RDKit canonicalizes them differently.

It's kind of a grotty molecule, so it's possible I'm missing
something. If so, I'd appreciate being set right.

--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Two SMILES that (I think) should canonicalize to the same thing, but don't

2015-06-16 Thread Peter Shenkin
Thanks, Andrew...

"BTW, to help it out, you can ask RDKit to include all of the bond information,
as otherwise it will use the "single-or-aromatic" notation."

That's a nice feature.

"I don't know how it is that RDKit adds a double bond to the second cubane,
given only aromatic carbons and single-or-aromatic bonds in the original
SMILES."

Well, I was first wondering how the aromatic atoms got in there at
all, but then I saw that if you just isolate the double-bonded
carbons, it's like Dewar benzene with the bridgehead H's removed.

I wonder if this C6H4 beast has a name (whether or not it has an
existence ;-) ).

--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Two SMILES that (I think) should canonicalize to the same thing, but don't

2015-06-16 Thread Peter Shenkin
Hi, Greg,

Within the SMILES framework, it seems to me that if you allow the atoms to
be aromatic, then these are two Kekule structures of the same aromatic
system, and however you do the canonicalization, they ought to canonicalize
to the same structure, which the two examples did not do. I don't think you
addressed this.

I think now that there is no issue with having a double bond between two
aromatic atoms beyond our preconceptions. If that is a problem, you could
Kekulize it per your first picture, (though perhaps that is inconvenient in
the context of the implementation).

I actually didn't realize why aromaticity (particularly the double bond)
made sense when I originally wrote, so the above is with the benefit of
hindsight, and your comments.

I think the molecule is entertaining in several ways. In the cubane
geometry, the molecule cannot be conventionally aromatic. Might it actually
be antiaromatic? Could there be two forms?

Dunno
-P.


On Wed, Jun 17, 2015 at 1:25 AM, Greg Landrum 
wrote:
>
>
> The problematic part of your two molecules can be reduced to:
> [image: Inline image 3]
> and
> [image: Inline image 4]
> That second one shows the kekulized form that the RDKit ends up using.
>
> These produce the following canonical SMILES:
>
> In [31]: Chem.CanonSmiles('C1=CC2=CC=C12')
> Out[31]: 'c1cc2ccc1-2'
>
> In [32]: Chem.CanonSmiles('C1=CC2=C1C=C2')
> Out[32]: 'c1cc2ccc1=2'
>
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Two SMILES that (I think) should canonicalize to the same thing, but don't

2015-06-17 Thread Peter Shenkin
"We could consider some quantum-mechanical calculations "

Yes! for the question of "the true nature" of the molecule. But that not
need not affect the way canonicalization is done.

These are two different forms of entertainment

-P.


On Wed, Jun 17, 2015 at 3:24 AM, Markus Sitzmann 
wrote:

> We could consider some quantum-mechanical calculations ... well, I always
> hated this discussion when I heard for my web service with millions of
> structures, I should consider quantum-mechanical calculations as part of
> the structure normalization/canonicalization ;-)
>
> On Wed, Jun 17, 2015 at 8:22 AM, Peter Shenkin  wrote:
>
>> Hi, Greg,
>>
>> Within the SMILES framework, it seems to me that if you allow the atoms
>> to be aromatic, then these are two Kekule structures of the same aromatic
>> system, and however you do the canonicalization, they ought to canonicalize
>> to the same structure, which the two examples did not do. I don't think you
>> addressed this.
>>
>> I think now that there is no issue with having a double bond between two
>> aromatic atoms beyond our preconceptions. If that is a problem, you could
>> Kekulize it per your first picture, (though perhaps that is inconvenient in
>> the context of the implementation).
>>
>> I actually didn't realize why aromaticity (particularly the double bond)
>> made sense when I originally wrote, so the above is with the benefit of
>> hindsight, and your comments.
>>
>> I think the molecule is entertaining in several ways. In the cubane
>> geometry, the molecule cannot be conventionally aromatic. Might it actually
>> be antiaromatic? Could there be two forms?
>>
>> Dunno
>> -P.
>>
>>
>> On Wed, Jun 17, 2015 at 1:25 AM, Greg Landrum 
>> wrote:
>>>
>>>
>>> The problematic part of your two molecules can be reduced to:
>>> [image: Inline image 3]
>>> and
>>> [image: Inline image 4]
>>> That second one shows the kekulized form that the RDKit ends up using.
>>>
>>> These produce the following canonical SMILES:
>>>
>>> In [31]: Chem.CanonSmiles('C1=CC2=CC=C12')
>>> Out[31]: 'c1cc2ccc1-2'
>>>
>>> In [32]: Chem.CanonSmiles('C1=CC2=C1C=C2')
>>> Out[32]: 'c1cc2ccc1=2'
>>>
>>>
>>
>> --
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
>
> --
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Two SMILES that (I think) should canonicalize to the same thing, but don't

2015-06-17 Thread Peter Shenkin
Hi,

I do not insist on using kekule forms. In fact, I said that using a
double bond between two aromatic atoms in a SMILES does not appear
problematic to me.

I was trying to say in the line you quoted that even if analysis of QM
results leads to a verdict of non-aromaticity, such a verdict should
not prevent us from creating canonical ("unique") SMILES using
aromatic atoms and bonds. The two actually have little to do with each
other.

( Start parenthetical remark:
Having said that, however, there are some situations where a SMILES is
traditionally created using aromatic types where that is unnecessary;
think furan and pyrrole. Aromatic types are unnecessary, because there
are no reasonable alternative kekule forms.

But even so, I am not at all arguing for elimination of aromatic types
from SMILES whenever feasible. It's fine with me if packages use
aromatic types for pyrrole and furan, and they for the most part do.
End parenthetical remark)

I've encountered a few situations where I would take issue with some
packages' use (or non-use) of aromatic types, and maybe (since we're
having fun with this topic) I'll post some of these at some point in a
different thread. But I don't feel this way about RDKit's
canonicalization of any of the systems we've been discussing in this
thread.

My point in this thread is the one stated in the Subject: line: there
are sometimes two equivalent SMILES that are canonicalized
differently. I'm happy to find that the prevailing view is in
agreement with my opinion that these specific cases are bugs. (Happy
only because that means they'll likely be fixed at some point!)

-P.





On Wed, Jun 17, 2015 at 1:34 PM, Dimitri Maziuk  wrote:
> On 06/17/2015 08:36 AM, Peter Shenkin wrote:
>> "We could consider some quantum-mechanical calculations "
>>
>> Yes! for the question of "the true nature" of the molecule. But that not
>> need not affect the way canonicalization is done.
>
> Again, define "canonical". If you insist on using kekule form in a
> binary computer, you'll have to have 2 distinctly different canonical
> benzenes. That's just how a binary computer works.
>
> --
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
>
>
> --
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Stereochemistry - Differences between RDKit & Indigo

2015-08-19 Thread Peter Shenkin
Maybe when you have a toolkit as blazingly fast as RDKit it captures the
chirality of N center before it has time to interconvert

-P.

On Wed, Aug 19, 2015 at 10:17 PM, John M 
wrote:

> More odd is the carbon stereocentre with two methyls...
>
> Generally trivalent nitrogens are not considered chiral due to inversion
> of the lone-pair. The two usual exceptions are when they are a bridgehead
> or in a tight ring (cyclopropane). This is the same in most toolkits, the
> InChI technical documentation provides useful examples.
>
> InChI actually only sees one stereo centre since it strips the proton off:
>
> InChI=1S/C13H26N2/c1-4-14-8-5-12(6-9-14)15-10-7-13(15)11(2)3/h11-13H,4-10H2,1-3H3/p+1/t13-/m1/s1
>
> It may well be chiral in this case but since it's not you should also
> strictly remove the other stereocentre in the para position to the nitrogen
>
> For the record just tested and ChemAxon/CDK/OpenBabel do the same.
>
> John
>
> Regards,
> John W May
> john.wilkinson...@gmail.com
>
> On 19 August 2015 at 09:00, Rob Smith  wrote:
>
>> Dear RDKit community,
>>
>> I'm trying to use RDKit to read in Corina generated stereoisomers (from a
>> Mol file), assign chiral tags and stereochemistry to the structure and
>> output the canonical smiles string for each isomer of a given molecule (in
>> Python), when I do this, half the canonical smiles strings are not unique.
>>
>> When I read in the output from Corina into an Indigo instance, then use
>> the canonical smiles from Indigo to create an RDKit molecule, canonical
>> smiles strings generated from the molecule objects are all unique.
>>
>> I may be missing an option to enable RDKit to 'visualise' the chiral
>> centre adjacent to the protonated nitrogen, so if someone can spot where
>> I've made a mistake, I'd really appreciate it. I've included the output and
>> Python script below. If you require any further information, please let me
>> know.
>>
>> Many thanks,
>> Rob
>>
>> Output:
>>
>> RDKit Read in of Molecule
>> RDKit Output -  CCN1CC[C@@H]([N@@H+]2CC[C@@H]2[C@H](C)C)CC1
>> RDKit Output -  CCN1CC[C@@H]([N@@H+]2CC[C@@H]2[C@H](C)C)CC1
>> RDKit Output -  CCN1CC[C@@H]([N@H+]2CC[C@@H]2[C@H](C)C)CC1
>> RDKit Output -  CCN1CC[C@@H]([N@H+]2CC[C@@H]2[C@H](C)C)CC1
>> RDKit Output -  CCN1CC[C@@H]([N@@H+]2CC[C@H]2[C@H](C)C)CC1
>> RDKit Output -  CCN1CC[C@@H]([N@@H+]2CC[C@H]2[C@H](C)C)CC1
>> RDKit Output -  CCN1CC[C@@H]([N@H+]2CC[C@H]2[C@H](C)C)CC1
>> RDKit Output -  CCN1CC[C@@H]([N@H+]2CC[C@H]2[C@H](C)C)CC1
>>
>> INDIGO Read in of Molecule
>> RDKit Output -  CC[N@]1CC[C@@H]([N@@H+]2CC[C@@H]2C(C)C)CC1
>> RDKit Output -  CC[N@]1CC[C@H]([N@@H+]2CC[C@@H]2C(C)C)CC1
>> RDKit Output -  CC[N@]1CC[C@@H]([N@H+]2CC[C@@H]2C(C)C)CC1
>> RDKit Output -  CC[N@]1CC[C@H]([N@H+]2CC[C@@H]2C(C)C)CC1
>> RDKit Output -  CC[N@]1CC[C@@H]([N@@H+]2CC[C@H]2C(C)C)CC1
>> RDKit Output -  CC[N@]1CC[C@H]([N@@H+]2CC[C@H]2C(C)C)CC1
>> RDKit Output -  CC[N@]1CC[C@@H]([N@H+]2CC[C@H]2C(C)C)CC1
>> RDKit Output -  CC[N@]1CC[C@H]([N@H+]2CC[C@H]2C(C)C)CC1
>>
>> Python script :
>>
>> from rdkit import Chem
>> import subprocess # Used to run Corina
>> from indigo import *
>>
>> def runCorinaTest(inputMol):
>> indigo = Indigo()
>>
>> molFile = Chem.MolToMolBlock(inputMol)
>>
>> corinaCommand = "echo \'" + molFile + "\' | "
>> # Then Corina - generate stereoisomers...
>> corinaCommand = corinaCommand + "/apps/corina/corina -t n -d
>> canon,stergen,preserve,names,wh,flapn,msc=7,msi=128 -i t=sdf"
>> corinaResult = subprocess.check_output([corinaCommand], shell=True) #
>> Gives the stereoisomer species as an SDF string
>>
>> allMoleculeObjects = []
>> allMolecules = corinaResult.split("\n") # Separate Corina output
>> into individual molecules
>> allMolecules = allMolecules[0:len(allMolecules)-1]
>>
>> print("RDKit Read in of Molecule")
>>
>> for eachMolecule in allMolecules:
>> eachMolecule = eachMolecule + "\n"
>> mol = Chem.MolFromMolBlock(eachMolecule, sanitize=True,
>> removeHs=True, strictParsing=False)
>> Chem.rdmolops.AssignAtomChiralTagsFromStructure(mol,
>> replaceExistingTags=True)
>> Chem.rdmolops.AssignStereochemistry(mol)
>> print("RDKit Output -  " + Chem.MolToSmiles(mol,
>> isomericSmiles=True))
>>
>> print("INDIGO Read in of Molecule")
>> for eachMolecule in allMolecules:
>> eachMolecule = eachMolecule + "\n"
>> mol = indigo.loadMolecule(eachMolecule)
>> # print("Indigo Output - " + mol.canonicalSmiles())
>> # Use Indigo Canonical Smiles to create RDKit molecule
>> mol = Chem.MolFromSmiles(mol.canonicalSmiles())
>> if mol is not None:
>> print("RDKit Output -  " + Chem.MolToSmiles(mol,
>> isomericSmiles=True))
>>
>> return 0
>>
>> mol = Chem.MolFromSmiles("CC(C)C1[NH+](C2CCN(CC)CC2)CC1")
>> z = runCorinaTest(mol)
>>
>>
>> --
>>
>> _

Re: [Rdkit-discuss] Stereochemistry - Differences between RDKit & Indigo

2015-08-20 Thread Peter Shenkin
"My initial answer, and I would love input on this, is that
three-coordinate N should always have stereochemistry removed."

Umm... even if it's a bridgehead?

-P.

On Thu, Aug 20, 2015 at 10:30 AM, Greg Landrum 
wrote:

> This isn't a simple one, so it may take a bit to get to an answer that's
> comprehensible.
>
> There are two things going on here in the RDKit:
> 1) Ring stereochemistry
> 2) stereochemistry about nitrogen centers
>
> Let's start with the second, because it's easier: RDKit does not generally
> "believe in" stereochemistry around three coordinate nitrogens. Here's a
> very simple example:
> In [45]: m3 = Chem.MolFromSmiles('Br[N@](F)Cl')
>
> In [46]: Chem.MolToSmiles(m3,isomericSmiles=True)
> Out[46]: 'FN(Cl)Br'
>
>
> The 3D equivalent of that:
> In [41]: m = Chem.MolFromSmiles('BrN(F)Cl')
>
> In [42]: AllChem.EmbedMolecule(m)
> Out[42]: 0
>
> In [43]: Chem.AssignAtomChiralTagsFromStructure(m)
>
> In [44]: Chem.MolToSmiles(m,isomericSmiles=True)
> Out[44]: 'FN(Cl)Br'
>
> Contrast this with what you get for a carbon:
>
> In [34]: m2 = Chem.MolFromSmiles('FC(Br)(Cl)I')
>
> In [35]: AllChem.EmbedMolecule(m2)
> Out[35]: 0
>
> In [36]: Chem.AssignAtomChiralTagsFromStructure(m2)
>
> In [37]: Chem.MolToSmiles(m2,isomericSmiles=True)
> Out[37]: 'F[C@](Cl)(Br)I'
>
>
> Back to the first: ring stereochemistry. By this I mean things like C[C@H
> ]1CC[C@@H](C)CC1 - molecules where the stereochemistry information is
> really about whether the substituents of the ring are cis or trans relative
> to the ring plane.
>
> The way the RDKit handles this is something of a hack: it doesn't identify
> those atoms as chiral centers, but it does preserve the chiral tags when
> generating a canonical SMILES:
>
> In [47]: m = Chem.MolFromSmiles('C[C@H]1CC[C@@H](C)CC1')
>
> In [48]: Chem.FindMolChiralCenters(m)
> Out[48]: []
>
> In [49]: Chem.MolToSmiles(m,isomericSmiles=True)
> Out[49]: 'C[C@H]1CC[C@@H](C)CC1'
>
> Curiously, to me at least, it does the same thing with nitrogens;
>
> In [52]: m2 = Chem.MolFromSmiles('C[N@@]1CC[C@@H](C)CC1')
>
> In [53]: Chem.MolToSmiles(m2,isomericSmiles=True)
> Out[53]: 'C[C@H]1CC[N@](C)CC1'
>
> Lest anyone think that this might make sense because being a ring makes
> inversion more difficult, that's not what is going on here. If I make the
> ring truly chiral, then the stereochemistry of the N is removed:
>
> In [54]: m3 = Chem.MolFromSmiles('C[N@@]1CO[C@@H](C)CC1')
>
> In [55]: Chem.MolToSmiles(m3,isomericSmiles=True)
> Out[55]: 'C[C@H]1CCN(C)CO1'
>
> I believe that this inconsistent behavior is a bug: either N should always
> have the input stereochemistry preserved (and that should be perceived from
> the 3D coordinates) or it should never have the input stereochemistry
> preserved. My initial answer, and I would love input on this, is that
> three-coordinate N should always have stereochemistry removed.
>
> -greg
>
>
>
> On Thu, Aug 20, 2015 at 2:22 PM, Rob Smith  wrote:
>
>> Hi Greg,
>>
>> I've attached the SDF that Corina generates. I'm not convinced it is a
>> problem, more an observation that I'm trying to understand.
>>
>> Looking at the results again today - it seems that from the Corina output
>> Indigo is interpreting the conformer (including whether the ethyl
>> substituent on the piperidine nitrogen is equatorial or axial) - and
>> outputting a canonical smiles string that has the conformer "encoded" in it
>> (using the chiral flags). Whereas RDKit is reading in the Corina output,
>> "discounting" whether the nitrogen is axial or equatorial (which due to
>> inversion I can understand) and interpreting it as having only two chiral
>> centers (which is correct).
>>
>> What is confusing me, is that when I supply RDKit with the canonical
>> smiles string from Indigo (which has the conformer "encoded" in it), and
>> then ask for the isomeric canonical smiles, it supplies the canonical
>> smiles with the conformer still "encoded" within it.
>>
>> For example, I read in the following canonical smiles string into
>> RDKit: CCN1CC[C@@H]([N@@H+]2CC[C@@H]2[C@H](C)C)CC1 (which was generated
>> by reading in one of the mols in the SD File into RDKit and output the
>> isomeric canonical smiles), running the FindMolChiralCenters on this
>> molecule, correctly reports the number of chiral centres to be 2 (6S, 9R),
>> and then asking it to output the canonical smiles string (with
>> isomericSmiles=True) gives CCN1CCC([N@@H+]2CC[C@@H]2C(C)C)CC1 (1).
>>
>> If I take the same mol file, read it into Indigo, and ask it to output
>> the canonical smiles string, I get: 
>> CC(C)[C@H]1CC[N@H+]1[C@@H]1CC[N@@](CC1)CC,
>> if I read this smiles string into RDKit and run FindMolCenters on it, I get
>> (3R, 6S) - which is fine, if I then out the canonical smiles (again with
>> isomericSmiles=True) I get CC[N@]1CC[C@@H]([N@@H+]2CC[C@@H]2C(C)C)CC1. I
>> expected this isomeric canonical smiles to be the same as (1), however
>> RDKit appears to conserve the conformer representation given to it from an