Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently
On Thu, Nov 9, 2017 at 6:32 AM, Brian Colewrote: > Hi Cheminformaticians, > > This is an extreme subtlety in the interpretation of SMILES atom > stereochemistry and I think a bug in RDKit. Specifically, I think the > following SMILES should be the same molecule: > > >>> rdkit.__version__ > '2017.09.1' > >>> Chem.CanonSmiles('F[C@@]1(C)CCO1') > 'C[C@]1(F)CCO1' > >>> Chem.CanonSmiles('[C@@](F)1(C)CCO1') > 'C[C@@]1(F)CCO1' > As was discussed in the comments of https://github.com/rdkit/rdkit/issues/786, I think it's pretty gross that the second syntax is even legal. But that's a side point. Since there is no hydrogen inside the stereo carbon atom block the bond > being 'looked down' should be the first atom encountered. In both cases > above, that should be the Florine, therefore the molecules should be > equivalent. > Agreed, and this is a view that's further supported by this behavior: In [2]: Chem.CanonSmiles('F[C@@]1(C)CCO1') Out[2]: 'C[C@]1(F)CCO1' In [3]: Chem.CanonSmiles('F[C@@](C)1CCO1') Out[3]: 'C[C@@]1(F)CCO1' Would you mind filing a bug for this and I'll try to track it down/fix it? Thanks, -greg > > Though it could be argued the 2nd one is not strict SMILES as Andrew > describes here: https://github.com/rdkit/rdkit/issues/786 > > It is useful when recombining fragments with ring closure digits for these > to be equivalent: > [*][C@]1(C)CCO1 > [C@]([*])1(C)CCO1 > > Also, every other tool I can get my hands on agrees they're the same: > OEChem, OpenBabel, indigo, and ChemAxon. (CDK lacks a simple enough > canonicalization example for me to work from.) > > Sure wish there was a SMILES validation test suite we could all run > against. And so I'm attaching the examples I used to verify the above so > whatever poor soul assigned that task later can find this on Google. (I'm > hopeful :-) > > Thanks, > Brian > > PS: the current output from the script: > > $ python stereo_handling_first_atom.py > RDKit = 2017.09.1 > OEChem = 2.1.2 > OpenBabel = 2.4.1 > indigo = 1.2.3.r0-g98188eb mac10.7 > RDKit failed to recognize these as the same: > [*:1][C@]1([*:2])CC1(Cl)Cl -> ClC1(Cl)C[C@]1([*:1])[*:2] > [C@]([*:1])1([*:2])CC1(Cl)Cl -> ClC1(Cl)C[C@@]1([*:1])[*:2] > OpenBabel failed to recognize these as the same: > Cl[S@](C)=O -> C[S@](=O)Cl > [S@](Cl)(C)=O -> C[S@@](=O)Cl > Indigo failed to recognize these as the same: > Cl[S@](C)=O -> C[S@](=O)Cl > [S@](Cl)(C)=O -> C[S@@](=O)Cl > OpenBabel failed to recognize these as the same: > Cl[S@](C)= -> =[S@](Cl)C > [S@](Cl)(C)= -> =[S@@](Cl)C > Indigo failed to recognize these as the same: > Cl[S@](C)= -> =[S@@](C)Cl > [S@](Cl)(C)= -> =[S@](C)Cl > RDKit failed to recognize these as the same: > Cl[C@](F)1CC[C@H](F)CC1 -> F[C@H]1CC[C@](F)(Cl)CC1 > [C@](Cl)(F)1CC[C@H](F)CC1 -> F[C@H]1CC[C@@](F)(Cl)CC1 > RDKit failed to recognize these as the same: > Cl[C@]1(c2c2)NCCCS1 -> Cl[C@]1(c2c2)NCCCS1 > [C@](Cl)1(c2c2)NCCCS1 -> Cl[C@@]1(c2c2)NCCCS1 > RDKit failed to recognize these as the same: > Cl3.[C@]31(c2c2)NCCCS1 -> Cl[C@]1(c2c2)NCCCS1 > [C@](Cl)1(c2c2)NCCCS1 -> Cl[C@@]1(c2c2)NCCCS1 > RDKit failed to recognize these as the same: > Cl[C@](F)1C2C(C1)CNC2 -> F[C@@]1(Cl)CC2CNCC21 > [C@](Cl)(F)1C2C(C1)CNC2 -> F[C@]1(Cl)CC2CNCC21 > RDKit failed to recognize these as the same: > [*][C@@H]1CO1 -> [*][C@@H]1CO1 > [C@H]([*])1CO1 -> [*][C@H]1CO1 > RDKit failed to recognize these as the same: > [*][C@@]1(C)CCO1 -> [*][C@@]1(C)CCO1 > [C@@]([*])1(C)CCO1 -> [*][C@]1(C)CCO1 > RDKit failed to recognize these as the same: > F[C@@]1(C)CCO1 -> C[C@]1(F)CCO1 > [C@@](F)1(C)CCO1 -> C[C@@]1(F)CCO1 > RDKit failed to recognize these as the same: > Cl[C@@H]1[C@@H](Cl)C(Cl)CCN1 -> ClC1CCN[C@H](Cl)[C@H]1Cl > [C@H](Cl)1[C@@H](Cl)C(Cl)CCN1 -> ClC1CCN[C@@H](Cl)[C@H]1Cl > > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] SMARTS for Joback and Reid method
Chenyang, I haven't looked at your smarts strings yet, but I do have this list of SMARTS strings for the joback method I compiled myself (for use here: https://www.wolframalpha.com/input/?i=2,3-methano-5,6-dichloroindene=3 ). Perhaps this can be of use. If you spot any mistakes, please let me know Jason $JobackSubstructures={ {"Methyl","-CH3", "[CX4H3]"}, {"SecondaryAcyclic", "-CH2-", "[!R;CX4H2]"}, {"TertiaryAcyclic",">CH-", "[!R;CX4H]"}, {"QuaternaryAcyclic", ">C<", "[!R;CX4H0]"}, {"PrimaryAlkene", "=CH2", "[CX3H2]"}, {"SecondaryAlkeneAcyclic", "=CH-", "[!R;CX3H1;!$([CX3H1](=O))]"}, {"TertiaryAlkeneAcyclic", "=C<", "[$([!R;#6X3H0]);!$([!R;#6X3H0]=[#8])]"}, {"CumulativeAlkene", "=C=", "[$([CX2H0](=*)=*)]"}, {"TerminalAlkyne", "\[Congruent]CH","[$([CX2H1]#[!#7])]"}, {"InternalAlkyne","\[Congruent]C-","[$([CX2H0]#[!#7])]"}, {"SecondaryCyclic", "-CH2- (ring)", "[R;CX4H2]"}, {"TertiaryCyclic", ">CH- (ring)", "[R;CX4H]"}, {"QuaternaryCyclic", ">C< (ring)", "[R;CX4H0]"}, {"SecondaryAlkeneCyclic", "=CH- (ring)", "[R;CX3H1,cX3H1]"}, {"TertiaryAlkeneCyclic", "=C< (ring)","[$([R;#6X3H0]);!$([R;#6X3H0]=[#8])]"}, {"Fluoro", "-F", "[F]"}, {"Chloro", "-Cl", "[Cl]"}, {"Bromo", "-Br", "[Br]"}, {"Iodo", "-I", "[I]"}, {"Alcohol","-OH", "[OX2H;!$([OX2H]-[#6]=[O]);!$([OX2H]-a)]"},(* alcohol - not matching a carboxylic acid *) {"Phenol","-OH", "[$([OX2H]-a)]"}, {"EtherAcyclic", "-O-", "[OX2H0;!R;!$([OX2H0]-[#6]=[#8])]"}, {"EtherCyclic", "-O- (ring)", "[#8X2H0;R;!$([#8X2H0]~[#6]=[#8])]"}, {"CarbonylAcyclic", ">C=O", "[$([CX3H0](=[OX1]));!$([CX3](=[OX1])-[OX2]);!R]=O"}, {"CarbonylCyclic", ">C=O (ring)","[$([#6X3H0](=[OX1]));!$([#6X3](=[#8X1])~[#8X2]);R]=O"}, {"Aldehyde","O=CH-","[CX3H1](=O)"}, {"CarboxylicAcid", "COOH", "[OX2H]-[C]=O"}, {"Ester", "-C(=O)O-", "[#6X3H0;!$([#6X3H0](~O)(~O)(~O))](=[#8X1])[#8X2H0]"}, {"OxygenDoubleBondOther", "=O", "[OX1H0;!$([OX1H0]~[#6X3]);!$([OX1H0]~[#7X3]~[#8])]"}, {"PrimaryAmino","NH2", "[NX3H2]"}, {"SecondaryAminoAcyclic",">NH", "[NX3H1;!R]"}, {"SecondaryAminoCyclic",">NH (ring)", "[#7X3H1;R]"}, {"TertiaryAmino", ">N-","[#7X3H0;!$([#7](~O)~O)]"}, (* Tertiary amine except nitro group *) {"ImineCyclic","=N- (ring)","[#7X2H0;R]"}, {"ImineAcyclic","=N-","[#7X2H0;!R]"}, {"Aldimine", "=NH", "[#7X2H1]"}, {"Cyano", "-C\[Congruent]N","[#6X2]#[#7X1H0]"}, {"Nitro", "NO2", "[$([#7X3,#7X3+][!#8])](=[O])~[O-]"}, {"Thiol", "-SH", "[SX2H]"}, {"ThioetherAcyclic", "-S-", "[#16X2H0;!R]"}, {"ThioetherCyclic", "-S- (ring)", "[#16X2H0;R]"} }; Jason Biggs On Wed, Nov 8, 2017 at 4:52 PM, Chenyang Shiwrote: > Hi everyone, > > I have been recently working on a project that implements Joback method > using RDKit (https://en.wikipedia.org/wiki/Joback_method). > > I believe the core to the success of this project is to make the 41 > functional groups correctly represented by SMARTS code. I have compiled my > own codes, see attachment. I would appreciate your review of it and let me > know if you spot errors. > > I think building a robust/well-tested SMARTS database (though small in my > case) would be helpful to others and other projects. > > Thank you, > Chenyang > > PS: The ones highlighted red in the document are robust. > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Rdkit-discuss Digest, Vol 121, Issue 15
The Daylight website is a very good resource for SMILES, SMARTS, and SMIRKS. http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html JW ___ JW Feng, Ph.D. Denali Therapeutics Inc. 151 Oyster Point Blvd, 2nd Floor, South San Francisco, CA 94080 | (650) 270-0628 On Wed, Nov 8, 2017 at 2:52 PM,wrote: > Send Rdkit-discuss mailing list submissions to > rdkit-discuss@lists.sourceforge.net > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > or, via email, send a message with subject or body 'help' to > rdkit-discuss-requ...@lists.sourceforge.net > > You can reach the person managing the list at > rdkit-discuss-ow...@lists.sourceforge.net > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Rdkit-discuss digest..." > > > Today's Topics: > >1. SMARTS for =C=, #CH, #C- (Chenyang Shi) >2. Re: SMARTS for =C=, #CH, #C- (Andrew Dalke) >3. Re: SMARTS for =C=, #CH, #C- (Chenyang Shi) >4. SMARTS for Joback and Reid method (Chenyang Shi) > > > -- > > Message: 1 > Date: Wed, 8 Nov 2017 14:00:36 -0600 > From: Chenyang Shi > To: RDKit Discuss > Subject: [Rdkit-discuss] SMARTS for =C=, #CH, #C- > Message-ID: > com> > Content-Type: text/plain; charset="utf-8" > > Dear RDKitters, > > I have a question regarding SMARTS codes for three simple functional > groups, these are =C=, #CH and #C-. I am new to SMARTS/SMILES. I indeed > tried to guess their codes. Here are my guesses: > > =C= : [CH0;A;X2;!R](=[$(*)])=[$(*)] > > #CH : [CH1;A;X2;!R]#[$(*)] > > #C- : [CH0;A;X2;!R]#[$(*)] > > I checked these SMARTS at > http://smartsview.zbh.uni-hamburg.de/smartsview/calculate?method=get; they > all seem make sense. > > For example, the webpage prints out following messages: > > =C=: it says "aliphatic C with 0 further total connections, with 0 further > hydrogen, not in a ring". > > #CH: "aliphatic C with 0 further total connections, with 1 further > hydrogen, not in a ring". > > #C-: "aliphatic C with 1 further total connections, with 0 further > hydrogen, not in a ring". > > However, when I search subgroups using these SMARTS, I had problems. > > For example, if I search "C=C=O" using "[CH0;A;X2;!R](=[$(*)])=[$(*)]", > >>> from rdkit import Chem > >>> m = Chem.MolFromSmiles('C=C=O') > >>> > m.GetSubstructMatches(Chem.MolFromSmarts("[CH0;A;X2;!R](=[$(*)])=[$(*)]")) > ((1, 0, 2),) > > it prints out atomic positions 1, 0, 2--three positions. But I would expect > only one position for the Carbon in the middle. > > Similarly, if I search "C#C" using "[CH1;A;X2;!R]#[$(*)]", > >>> from rdkit import Chem > >>> m = Chem.MolFromSmiles('C#C') > >>> m.GetSubstructMatches(Chem.MolFromSmarts("[CH1;A;X2;!R]#[$(*)]")) > ((0, 1),) > I would expect two separate positions such as (0,), (1,), indicating there > are two carbon triple bonds (with an hydrogen). > > > Then if if I search "CC#CC" using " [CH0;A;X2;!R]#[$(*)]", > >>> from rdkit import Chem > >>> m = Chem.MolFromSmiles('CC#CC') > >>> m.GetSubstructMatches(Chem.MolFromSmarts(" [CH0;A;X2;!R]#[$(*)]")) > ((1, 2),) > Again, I would expect two separate positions such as (1,), (2,), indicating > two carbon triple bonds. > > I think the problem might be my SMARTS for these three groups are not > SPECIFIC. I would appreciate everyone's help on this. > > Cheers, > Chenyang > -- next part -- > An HTML attachment was scrubbed... > > -- > > Message: 2 > Date: Wed, 8 Nov 2017 21:27:29 +0100 > From: Andrew Dalke > Cc: RDKit Discuss > Subject: Re: [Rdkit-discuss] SMARTS for =C=, #CH, #C- > Message-ID: <8478f1ae-4916-4feb-8e67-e6cf4e52f...@dalkescientific.com> > Content-Type: text/plain; charset=us-ascii > > On Nov 8, 2017, at 21:00, Chenyang Shi wrote: > > =C= : [CH0;A;X2;!R](=[$(*)])=[$(*)] > > The recursive SMARTS notation, which is the term inside of the [$(...)], > finds a match for the entire pattern and returns the first atom in that > pattern. > > > For example, if I search "C=C=O" using "[CH0;A;X2;!R](=[$(*)])=[$(*)]", > > >>> from rdkit import Chem > > >>> m = Chem.MolFromSmiles('C=C=O') > > >>> m.GetSubstructMatches(Chem.MolFromSmarts("[CH0;A;X2;!R](= > [$(*)])=[$(*)]")) > > ((1, 0, 2),) > > > > it prints out atomic positions 1, 0, 2--three positions. But I would > expect only one position for the Carbon in the middle. > > The $(*) finds the pattern, which is a "*" and in this case the terminal > carbons, and returns it. The substructure search returns 3 positions > because the first is [CH0;A;X2;!R], the second is the first atom of
Re: [Rdkit-discuss] SMARTS for =C=, #CH, #C-
Dear Andy, Thank you for a quick and thorough email. I find it very instructional, although I need to read it a couple times more to digest it. Cheers, Chenyang On Wed, Nov 8, 2017 at 2:27 PM, Andrew Dalkewrote: > On Nov 8, 2017, at 21:00, Chenyang Shi wrote: > > =C= : [CH0;A;X2;!R](=[$(*)])=[$(*)] > > The recursive SMARTS notation, which is the term inside of the [$(...)], > finds a match for the entire pattern and returns the first atom in that > pattern. > > > For example, if I search "C=C=O" using "[CH0;A;X2;!R](=[$(*)])=[$(*)]", > > >>> from rdkit import Chem > > >>> m = Chem.MolFromSmiles('C=C=O') > > >>> m.GetSubstructMatches(Chem.MolFromSmarts("[CH0;A;X2;!R](= > [$(*)])=[$(*)]")) > > ((1, 0, 2),) > > > > it prints out atomic positions 1, 0, 2--three positions. But I would > expect only one position for the Carbon in the middle. > > The $(*) finds the pattern, which is a "*" and in this case the terminal > carbons, and returns it. The substructure search returns 3 positions > because the first is [CH0;A;X2;!R], the second is the first atom of "*", > and the third is the first atom of the other "*". > > If you only want the first atom the entire pattern, then put the entire > pattern in a recursive SMARTS, as in: > > [$([CH0;A;X2;!R](=*)=*)] > > >>> pat = Chem.MolFromSmarts("[$([CH0;A;X2;!R](=*)=*)]") > >>> mol = Chem.MolFromSmiles('C=C=O') > >>> mol.GetSubstructMatches(pat) > ((1,),) > > > Similarly, if I search "C#C" using "[CH1;A;X2;!R]#[$(*)]", > > >>> from rdkit import Chem > > >>> m = Chem.MolFromSmiles('C#C') > > >>> m.GetSubstructMatches(Chem.MolFromSmarts("[CH1;A;X2;!R]#[$(*)]")) > > ((0, 1),) > > I would expect two separate positions such as (0,), (1,), indicating > there are two carbon triple bonds (with an hydrogen). > > Since you are only looking for a single atom, try putting the entire > pattern in a recursive SMARTS, as in > > [$([CH1;A;X2;!R]#*)] > > >>> mol = Chem.MolFromSmiles("C#C") > >>> pat = Chem.MolFromSmarts("[$([CH1;A;X2;!R]#*)]") > >>> mol.GetSubstructMatches(pat) > ((0,), (1,)) > > > > Then if if I search "CC#CC" using " [CH0;A;X2;!R]#[$(*)]", > > I believe you want "[$([CH0;A;X2;!R]#*)]" > > Thank you for your clear description of what you expected. > > Cheers, > > Andrew > da...@dalkescientific.com > > > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] SMARTS for =C=, #CH, #C-
On Nov 8, 2017, at 21:00, Chenyang Shiwrote: > =C= : [CH0;A;X2;!R](=[$(*)])=[$(*)] The recursive SMARTS notation, which is the term inside of the [$(...)], finds a match for the entire pattern and returns the first atom in that pattern. > For example, if I search "C=C=O" using "[CH0;A;X2;!R](=[$(*)])=[$(*)]", > >>> from rdkit import Chem > >>> m = Chem.MolFromSmiles('C=C=O') > >>> m.GetSubstructMatches(Chem.MolFromSmarts("[CH0;A;X2;!R](=[$(*)])=[$(*)]")) > ((1, 0, 2),) > > it prints out atomic positions 1, 0, 2--three positions. But I would expect > only one position for the Carbon in the middle. The $(*) finds the pattern, which is a "*" and in this case the terminal carbons, and returns it. The substructure search returns 3 positions because the first is [CH0;A;X2;!R], the second is the first atom of "*", and the third is the first atom of the other "*". If you only want the first atom the entire pattern, then put the entire pattern in a recursive SMARTS, as in: [$([CH0;A;X2;!R](=*)=*)] >>> pat = Chem.MolFromSmarts("[$([CH0;A;X2;!R](=*)=*)]") >>> mol = Chem.MolFromSmiles('C=C=O') >>> mol.GetSubstructMatches(pat) ((1,),) > Similarly, if I search "C#C" using "[CH1;A;X2;!R]#[$(*)]", > >>> from rdkit import Chem > >>> m = Chem.MolFromSmiles('C#C') > >>> m.GetSubstructMatches(Chem.MolFromSmarts("[CH1;A;X2;!R]#[$(*)]")) > ((0, 1),) > I would expect two separate positions such as (0,), (1,), indicating there > are two carbon triple bonds (with an hydrogen). Since you are only looking for a single atom, try putting the entire pattern in a recursive SMARTS, as in [$([CH1;A;X2;!R]#*)] >>> mol = Chem.MolFromSmiles("C#C") >>> pat = Chem.MolFromSmarts("[$([CH1;A;X2;!R]#*)]") >>> mol.GetSubstructMatches(pat) ((0,), (1,)) > Then if if I search "CC#CC" using " [CH0;A;X2;!R]#[$(*)]", I believe you want "[$([CH0;A;X2;!R]#*)]" Thank you for your clear description of what you expected. Cheers, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] SMARTS for =C=, #CH, #C-
Dear RDKitters, I have a question regarding SMARTS codes for three simple functional groups, these are =C=, #CH and #C-. I am new to SMARTS/SMILES. I indeed tried to guess their codes. Here are my guesses: =C= : [CH0;A;X2;!R](=[$(*)])=[$(*)] #CH : [CH1;A;X2;!R]#[$(*)] #C- : [CH0;A;X2;!R]#[$(*)] I checked these SMARTS at http://smartsview.zbh.uni-hamburg.de/smartsview/calculate?method=get; they all seem make sense. For example, the webpage prints out following messages: =C=: it says "aliphatic C with 0 further total connections, with 0 further hydrogen, not in a ring". #CH: "aliphatic C with 0 further total connections, with 1 further hydrogen, not in a ring". #C-: "aliphatic C with 1 further total connections, with 0 further hydrogen, not in a ring". However, when I search subgroups using these SMARTS, I had problems. For example, if I search "C=C=O" using "[CH0;A;X2;!R](=[$(*)])=[$(*)]", >>> from rdkit import Chem >>> m = Chem.MolFromSmiles('C=C=O') >>> m.GetSubstructMatches(Chem.MolFromSmarts("[CH0;A;X2;!R](=[$(*)])=[$(*)]")) ((1, 0, 2),) it prints out atomic positions 1, 0, 2--three positions. But I would expect only one position for the Carbon in the middle. Similarly, if I search "C#C" using "[CH1;A;X2;!R]#[$(*)]", >>> from rdkit import Chem >>> m = Chem.MolFromSmiles('C#C') >>> m.GetSubstructMatches(Chem.MolFromSmarts("[CH1;A;X2;!R]#[$(*)]")) ((0, 1),) I would expect two separate positions such as (0,), (1,), indicating there are two carbon triple bonds (with an hydrogen). Then if if I search "CC#CC" using " [CH0;A;X2;!R]#[$(*)]", >>> from rdkit import Chem >>> m = Chem.MolFromSmiles('CC#CC') >>> m.GetSubstructMatches(Chem.MolFromSmarts(" [CH0;A;X2;!R]#[$(*)]")) ((1, 2),) Again, I would expect two separate positions such as (1,), (2,), indicating two carbon triple bonds. I think the problem might be my SMARTS for these three groups are not SPECIFIC. I would appreciate everyone's help on this. Cheers, Chenyang -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Python code to merge tuples from a SMARTS match
Brian, Greg, and David, Thank you for your suggestions. I will try to respond to your questions and comments: I am trying to reproduce results from a literature paper that used non-PYTHON and non-RDkit code to identify certain patterns in molecules as part of a group contribution scheme resulting in the prediction of thermodynamic quantities. I have a training set of molecules and the results of calculations for that training set (individual counts of groups of atoms and resulting energies). Hence, my first goal is to reproduce the results reported for that training set, but using PYTHON and RDkit. Since my goal is to reproduce literature results as closely as possible, I am not in a position to debate the logic of the original authors in their assignments of SMARTS/SMILES matching and counts. After this initial goal is met, I might consider alternative pattern matching and counting schemes and compare those results to the literature results. In fact, that would be good science. As I mentioned in my first email on this topic, I do think I have come up with a "rule" that will give me the correct answer (I have tried it for 8 cases using pencil and paper), my challenge is to code up the "rule" in PYTHON. I am a beginner at PYTHON, so I am struggling to get this idea into functional, bug-free code. Peter Shenkin's idea/code is getting close to what needs to be done, but doesn't handle all the cases. Regards, Jim Metz -Original Message- From: Brian ColeTo: James T. Metz Cc: RDKit Discuss Sent: Tue, Nov 7, 2017 7:23 pm Subject: Re: [Rdkit-discuss] Python code to merge tuples from a SMARTS match You can use Chem.CanonicalRankAtoms to de-duplicate the SMARTS matches based upon the atom symmetry like this: def count_unique_substructures(smiles, smarts): mol = Chem.MolFromSmiles(smiles) ranks = list(Chem.CanonicalRankAtoms(mol, breakTies=False)) pattern = Chem.MolFromSmarts(smarts) unique_sets_of_atoms = set() for match in mol.GetSubstructMatches(pattern): match_ranks = frozenset([ranks[idx] for idx in match]) unique_sets_of_atoms.add(match_ranks) return len(unique_sets_of_atoms) However, this returns 1 for each of your cases. It's not clear to me why you would want your 2nd case to return 2 as all paths from a chlorine to a chlorine through 2 carbons are symmetric. >>> SMARTS = '[Cl]-[C,c]-,=,:[C,c]-[Cl]' >>> smiles1 = 'ClC(Cl)CCl' >>> smiles2 = 'ClC(Cl)C(Cl)(Cl)(Cl)' >>> count_unique_substructures(smiles1, SMARTS) 1 >>> count_unique_substructures(smiles2, SMARTS) 1 -Brian On Tue, Nov 7, 2017 at 7:38 PM, James T. Metz via Rdkit-discuss wrote: RDkit Discussion Group, I have written a SMARTS to detect vicinal chlorine groups using RDkit. There are 4 atoms involved in a vicinal chlorine group. SMARTS = '[Cl]-[C,c]-,=,:[C,c]-[Cl]' I am trying to count the number of ("unique") occurrences of this pattern. For some molecules with symmetry, this results in over-counting. For the molecule, smiles1 below, I want to obtain a count of 1 i.e., 1 tuple of 4 atoms. smiles1 = 'ClC(Cl)CCl' However, using the SMARTS above, I obtain 2 tuples of 4 atoms. Beginning with a MOL file representation of smiles1, I get ((1,2,4,3), (0,2,4,3)) One possible solution is to somehow merge the two tuples according to a "rule." One rule that works is "if 3 of the atom indices are the same, then combine into one tuple." However, the rule needs a bit of modification for more complicated cases (higher symmetry). Consider smiles2 = 'ClC(Cl)CCl(Cl)(Cl) My goal is to get 2 tuples of 4 atoms for smiles2 smiles2 is somewhat tricky because there are either 2 groups of 3 (4 atom) tuples, or 3 groups of 2 (4 atom) tuples depending on how you choose your 3 atom indices. Again, if my goal is to get 2 tuples, then I need to somehow pick the largest group, i.e., 2 groups of 3 tuples to do the merge operation which will give me 2 remaining groups (desired). I have already checked stackoverflow and a few other places for PYTHON code to do the necessary merging, but I could not find anything specific and appropriate. I would be most grateful if anyone has ideas how to do this. I suspect the answer is a few lines of well-written PYTHON code, and not modifying the SMARTS (I could be mistaken!). Thank you. Regards, Jim Metz -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net
Re: [Rdkit-discuss] Python code to merge tuples from a SMARTS match
Peter, Thank you for your suggestions and accompanying code. I have modified your code slightly and have created 3 tuples for testing. Your code works for tuples, match1 and match2, but does not work for match3. The code should return a 2 for match3, because there are 2 sets of 3 tuples each containing 4 atom indices. Using my "rule" that, "if 3 indices are the same, they are in one group and one must form the groups of the largest possible size", one arrives at 2 groups. The merge function should then select one tuple from each group, resulting in a count of 2 (for the final number of groups). Keep in mind that I will not know how many groups of tuples will be created for any given molecule. Hence, I can not use hard coded array indices. Any ideas how to modify the code below to obtain the desired result for tuple, match3, and how to deal with tuples of various sizes? Regards, Jim Metz def merge2(matches): if len(matches) > 1: d = {} for match in matches: t = (matches[0], matches[1]) if (matches[0] < matches[1]): t = (matches[0], matches[1]) else: t = (matches[1], matches[0]) d[t] = match merged_match = (d[t],) else: merged_match = matches count = len(merged_match) return(count) match1 = ((0,2,3,4),) match2 = ((0,2,3,4), (1,2,3,4)) match3 = ((0,2,4,5), (1,2,5,6), (2,3,4,5), (2,3,5,6), (0,2,5,6), (1,2,4,5)) matches = match2 # Change the number to test different tuples output = merge2(matches) print("Output is ", output) -Original Message- From: Peter S. ShenkinTo: James T. Metz Cc: RDKit Discuss Sent: Tue, Nov 7, 2017 7:05 pm Subject: Re: [Rdkit-discuss] Python code to merge tuples from a SMARTS match I think you probably used a slightly different SMILES than the one you showed. The one you showed should have given ((0,1,3,4),(2,1,3,4)). The proper merge rule would then be to consider all matches equivalent if the 2nd and 3rd atom in the match agree, in any order; i.e, the two carbons, indices 1 and 3 in this case. So to do this, for each molecule, do something like this: d = dict{} for match in matches: t = (match[1], match[2]) if match[1] < match[2] ): t = (match[1], match[2]) else: t = (match[2], match[1]) d[t] = match You will wind up with as many dictionary elements as there are matches. -P. On Tue, Nov 7, 2017 at 7:38 PM, James T. Metz via Rdkit-discuss wrote: RDkit Discussion Group, I have written a SMARTS to detect vicinal chlorine groups using RDkit. There are 4 atoms involved in a vicinal chlorine group. SMARTS = '[Cl]-[C,c]-,=,:[C,c]-[Cl]' I am trying to count the number of ("unique") occurrences of this pattern. For some molecules with symmetry, this results in over-counting. For the molecule, smiles1 below, I want to obtain a count of 1 i.e., 1 tuple of 4 atoms. smiles1 = 'ClC(Cl)CCl' However, using the SMARTS above, I obtain 2 tuples of 4 atoms. Beginning with a MOL file representation of smiles1, I get ((1,2,4,3), (0,2,4,3)) One possible solution is to somehow merge the two tuples according to a "rule." One rule that works is "if 3 of the atom indices are the same, then combine into one tuple." However, the rule needs a bit of modification for more complicated cases (higher symmetry). Consider smiles2 = 'ClC(Cl)CCl(Cl)(Cl) My goal is to get 2 tuples of 4 atoms for smiles2 smiles2 is somewhat tricky because there are either 2 groups of 3 (4 atom) tuples, or 3 groups of 2 (4 atom) tuples depending on how you choose your 3 atom indices. Again, if my goal is to get 2 tuples, then I need to somehow pick the largest group, i.e., 2 groups of 3 tuples to do the merge operation which will give me 2 remaining groups (desired). I have already checked stackoverflow and a few other places for PYTHON code to do the necessary merging, but I could not find anything specific and appropriate. I would be most grateful if anyone has ideas how to do this. I suspect the answer is a few lines of well-written PYTHON code, and not modifying the SMARTS (I could be mistaken!). Thank you. Regards, Jim Metz -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- Check out the vibrant tech community
[Rdkit-discuss] RPM distros
There is mention of RPM distributions of RDKit (https://copr.fedorainfracloud.org/coprs/giallu/rdkit/). But on trying these: 1. the distro is based on the 2017_03_1 release 2. it fails due to missing libinchi.so.1 dependency. This is presumably no longer being maintained? Anything that can be done to help with fixing this? Tim -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss