On Oct 7, 2015, at 11:38 PM, Ling Chan wrote:
> Or you can use AllChem.CalcMolFormula() to get the chemical formula.

Well spotted! It's a bit tricky because it needs to handle carbons with/without 
count ("CH4", "C2H6"), and structures with no carbons ("P", "Ca", "Cd"); the 
last two start with a C but aren't carbon.

Here's my code for it; it took me a couple of iterations to get right:

import re
_carbon_pat = re.compile(r"""
  C[a-z]|     # Ignore non-carbon formulas that start with "C", like "Ca"
  (
    C         # First character must be a 'C'
    ([0-9]*)  # followed by optional count
  )""", re.X)


def count4(mol):
    formula = AllChem.CalcMolFormula(mol)
    m = _carbon_pat.match(formula)
    if m is None:
        return 0  # "P"
    if m.group(1) is not None:
        count = m.group(2)
        if count:
            return int(count) # "C2H6"
        return 1  # "CH4"
    return 0  # "Cd"

But what about timings?

For a simple SMILES it's about 2x faster than the SMARTS-based version:

== CCCCCCCCc1ccccc1 ==
count2 (14, 0.11608195304870605)   <-- count element 6
count3 (14, 0.027531147003173828)  <-- use a SMARTS
count4 (14, 0.011680126190185547)  <-- use the molecular formula

(The output shows the count and the average time in milliseconds.)

For a long SMILES it's over an order of magnitude faster

== "[C]"*1500 ==
count2 (1500, 4.046566963195801)
count3 (1500, 16.144488096237183)
count4 (1500, 0.2724030017852783)

The worst case scenario is a large structure with no carbons.

== "[Cd]"*1500 ==
count2 (0, 3.7225089073181152)
count3 (0, 0.09242105484008789)
count4 (0, 0.2670719623565674)

This is unrealistic, and even then, not horribly worse than the SMARTS match 
code.

I fed it some ChEBI structures and found the molecular formula was about 2x 
faster than the SMARTS match, and almost 10x faster than iterating over the 
atoms.

count2 188.5 ms  (average of 1000)
count3  40.9 ms
count4  20.8 ms


On the other hand, the code is more complex.


                                Andrew
                                da...@dalkescientific.com



------------------------------------------------------------------------------
Full-scale, agent-less Infrastructure Monitoring from a single dashboard
Integrate with 40+ ManageEngine ITSM Solutions for complete visibility
Physical-Virtual-Cloud Infrastructure monitoring from one console
Real user monitoring with APM Insights and performance trend reports 
Learn More http://pubads.g.doubleclick.net/gampad/clk?id=247754911&iu=/4140
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to