On Oct 7, 2015, at 11:30 AM, Christos Kannas wrote: > Yes there is an easier way, by using substructure search, i.e. do a > substructure search for [C] and then get the number of matches.
> m = Chem.MolFromSmiles("CCCCCCCCc1ccccc1") > patt= Chem.MolFromSmarts("[C]") > pm = m.GetSubstructMatches(patt) > print len(pm) > 8 You'll notice that this returned 8 when Joos Kiener wanted the count of all the C-Atoms. I assume that means all 14 carbon atoms, and not only the 8 aliphatic carbon atoms, so the SMARTS should be tweaked somewhat: >>> from rdkit import Chem >>> mol = Chem.MolFromSmiles("CCCCCCCCc1ccccc1") >>> pat = Chem.MolFromSmarts("[#6]") >>> len(mol.GetSubstructMatches(pat)) 14 On the topic of counting carbons given a molecule, there are two general approaches - the SMARTS pattern, and atom iteration, though it's better to count the number of atomic number matches rather than use the symbol: from rdkit import Chem pat = Chem.MolFromSmarts("[#6]") def count1(mol): return len(mol.GetSubstructMatches(pat)) def count2(mol): return sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() == 6) It's possible to time these two functions, to see which has the better performance: import time def go(count, mol, n=1000): _range = range(n) t1 = time.time() m = mol for x in _range: c = count(mol) t2 = time.time() return c, t2-t1 With that timing scaffold in place: >>> print go(count1, mol) (8, 0.020318031311035156) >>> print go(count2, mol) (8, 0.07634186744689941) This makes it seem like the SMARTS match solution is the best. However, there are two lurking details. First, when the structure has more than ~256 carbons, the iterative solution is faster: >>> for i in range(10): ... mol = Chem.MolFromSmiles("C" * (2**i)) ... c1, t1 = go(count1, mol) ... c2, t2 = go(count2, mol) ... assert c1 == c2 ... print 2**i, t1, t2 ... 1 0.00930690765381 0.0515530109406 2 0.00659918785095 0.0535910129547 4 0.00970101356506 0.0570240020752 8 0.0159358978271 0.0636560916901 16 0.0273370742798 0.0776889324188 32 0.0527520179749 0.103546857834 64 0.107499837875 0.159503221512 128 0.218550920486 0.266411066055 256 0.513395786285 0.4887611866 512 1.53533577919 0.901851892471 Second, by default there is a limit to the number of SMARTS matches: >>> mol = Chem.MolFromSmiles("C" * 999) >>> count1(mol) 999 >>> mol = Chem.MolFromSmiles("C" * 1000) >>> count1(mol) 1000 >>> mol = Chem.MolFromSmiles("C" * 1001) >>> count1(mol) 1000 Quoting from the documentation: In high-symmetry cases with medium-sized molecules, it is very easy to end up with a combinatorial explosion in the number of possible matches. This argument prevents that from having unintended consequences Instead, this is a better general-purpose SMART matcher count function: def count3(mol): return len(mol.GetSubstructMatches(pat, maxMatches=mol.GetNumAtoms())) >>> count3(mol) 1001 Andrew da...@dalkescientific.com ------------------------------------------------------------------------------ Full-scale, agent-less Infrastructure Monitoring from a single dashboard Integrate with 40+ ManageEngine ITSM Solutions for complete visibility Physical-Virtual-Cloud Infrastructure monitoring from one console Real user monitoring with APM Insights and performance trend reports Learn More http://pubads.g.doubleclick.net/gampad/clk?id=247754911&iu=/4140 _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss