On Oct 7, 2015, at 11:38 PM, Ling Chan wrote: > Or you can use AllChem.CalcMolFormula() to get the chemical formula.
Well spotted! It's a bit tricky because it needs to handle carbons with/without count ("CH4", "C2H6"), and structures with no carbons ("P", "Ca", "Cd"); the last two start with a C but aren't carbon. Here's my code for it; it took me a couple of iterations to get right: import re _carbon_pat = re.compile(r""" C[a-z]| # Ignore non-carbon formulas that start with "C", like "Ca" ( C # First character must be a 'C' ([0-9]*) # followed by optional count )""", re.X) def count4(mol): formula = AllChem.CalcMolFormula(mol) m = _carbon_pat.match(formula) if m is None: return 0 # "P" if m.group(1) is not None: count = m.group(2) if count: return int(count) # "C2H6" return 1 # "CH4" return 0 # "Cd" But what about timings? For a simple SMILES it's about 2x faster than the SMARTS-based version: == CCCCCCCCc1ccccc1 == count2 (14, 0.11608195304870605) <-- count element 6 count3 (14, 0.027531147003173828) <-- use a SMARTS count4 (14, 0.011680126190185547) <-- use the molecular formula (The output shows the count and the average time in milliseconds.) For a long SMILES it's over an order of magnitude faster == "[C]"*1500 == count2 (1500, 4.046566963195801) count3 (1500, 16.144488096237183) count4 (1500, 0.2724030017852783) The worst case scenario is a large structure with no carbons. == "[Cd]"*1500 == count2 (0, 3.7225089073181152) count3 (0, 0.09242105484008789) count4 (0, 0.2670719623565674) This is unrealistic, and even then, not horribly worse than the SMARTS match code. I fed it some ChEBI structures and found the molecular formula was about 2x faster than the SMARTS match, and almost 10x faster than iterating over the atoms. count2 188.5 ms (average of 1000) count3 40.9 ms count4 20.8 ms On the other hand, the code is more complex. Andrew da...@dalkescientific.com ------------------------------------------------------------------------------ Full-scale, agent-less Infrastructure Monitoring from a single dashboard Integrate with 40+ ManageEngine ITSM Solutions for complete visibility Physical-Virtual-Cloud Infrastructure monitoring from one console Real user monitoring with APM Insights and performance trend reports Learn More http://pubads.g.doubleclick.net/gampad/clk?id=247754911&iu=/4140 _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss