James, On Fri, Sep 30, 2011 at 8:48 AM, James Davidson <j.david...@vernalis.com> wrote: > > Greg wrote: >> You actually don't need to add the Hs: >> >>> p1 = Chem.MolFromSmarts('[#7,#8;H1]') >> >>> p2 = Chem.MolFromSmarts('[#7,#8;H2]') >> >>> p3 = Chem.MolFromSmarts('[#7,#8;H3]') m = >> >>> Chem.MolFromSmiles('CC(=O)N') >> >>> m2 = Chem.MolFromSmiles('OCC(=O)N') >> >>> def NHOHCount(mol): return >> >>> >> len(mol.GetSubstructMatches(p1))+2*len(mol.GetSubstructMatches(p2))+ >> >>> 3*len(mol.GetSubstructMatches(p3)) >> ... >> >>> NHOHCount(m) >> 2 >> >>> NHOHCount(m2) >> 3 > > I think this system works well in almost all cases : ) However, I had a > nagging concern over a couple of 'edge' cases - namely water, and > ammonia (and for that matter, the oxonium and ammonium ions).
You're exactly right. I showed the SMARTS-based version as a simple illustration. The version that's actually checked in is using a different method (it loops over all O and N atoms and counts the number of Hs connected to each). > I guess the simple inclusion of P4 = Chem.MolFromSmarts('[#8;H4]') would > make sure all cases were covered(?). > > Out of interest, I decided to compile a small list of 'normal' and > 'edge' case SMILES, and ran it through the MOE descriptor node in KNIME. > For all these cases, lip_don behaves as I would expect (tab-separated > output included below) Some comments on this below. > > "SMILES" "a_acc" "a_don" "lip_acc" "lip_don" > "CO" 1.0 1.0 1.0 1.0 > "C(=O)N" 1.0 1.0 2.0 2.0 > "O" 1.0 1.0 1.0 2.0 > "CN" 1.0 1.0 1.0 2.0 > "[O+]" 1.0 0.0 1.0 3.0 > "C[O+]" 1.0 0.0 1.0 2.0 > "[N+]" 0.0 0.0 1.0 4.0 > "C[N+]" 0.0 0.0 1.0 3.0 > "[N-]" 0.0 1.0 1.0 2.0 > "[O-]" 0.0 1.0 1.0 1.0 > "C(=O)[N-]" 0.0 1.0 2.0 1.0 For what it's worth: the results here are definitely not correct for the SMILES as provided. Atoms in SMILES that are in square brackets have no implicit Hs, so [N+] actually has zero hydrogens. I guess you actually provided the molecules to MOE in some other form. Sample script using your data (with corrected SMILES): # ------------------- from rdkit import Chem from rdkit.Chem import Lipinski d=[ ["CO", 1.0, 1.0, 1.0, 1.0,], ["C(=O)N", 1.0, 1.0, 2.0, 2.0], ["O", 1.0, 1.0, 1.0, 2.0,], ["CN", 1.0, 1.0, 1.0, 2.0,], ["[OH3+]", 1.0, 0.0, 1.0, 3.0,], ["C[OH2+]", 1.0, 0.0, 1.0, 2.0,], ["[NH4+]", 0.0, 0.0, 1.0, 4.0,], ["C[NH3+]", 0.0, 0.0, 1.0, 3.0,], ["[NH2-]", 0.0, 1.0, 1.0, 2.0,], ["[OH-]", 0.0, 1.0, 1.0, 1.0,], ["C(=O)[NH-]", 0.0, 1.0, 2.0, 1.0]] print 'Smiles NOCount NHOHCount' for row in d: m = Chem.MolFromSmiles(row[0]) hba = Lipinski.NOCount(m) hbd = Lipinski.NHOHCount(m) print row[0],hba,hbd #----------------------------------- Output with the SVN version of the RDKit: #------------------ Smiles NOCount NHOHCount CO 1 1 C(=O)N 2 2 O 1 2 CN 1 2 [OH3+] 1 3 C[OH2+] 1 2 [NH4+] 1 4 C[NH3+] 1 3 [NH2-] 1 2 [OH-] 1 1 C(=O)[NH-] 2 1 #----------------- Best, -greg ------------------------------------------------------------------------------ All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2dcopy2 _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss