Greg-

All of your suggestions worked great and RDKit did an amazing job of creating nice 3D conformations... quickly too!

Thank you for all your helpful support and great work on RDKit.

-Marshall

Here is the the script for anyone looking to perform a similar task:

########## rdkit_solution.py #############
import Chem
from Chem import AllChem

# load SDF into supplier
supplier = Chem.SDMolSupplier('usdrug.separate.sdf')

# open bad molecule logfile
badmollog = open('bad_mols.log','w')
converted = open('converted.sdf','w')

def embedsubroutine(mol):
        num = mol.GetNumAtoms()
        if num > 500:
        #embed with randcoords
embedresult = AllChem.EmbedMolecule(mol,useRandomCoords=True)
                if embedresult == 0:
                        return embedresult

        #embed normally
        embedresult = AllChem.EmbedMolecule(mol)
        return embedresult

def optimizesubroutine(mol):
        optimizeresult = AllChem.UFFOptimizeMolecule(mol,1000)
        return optimizeresult

count = 0
# for loop to go through the supplier
for i,mol in enumerate(supplier):
        count+=1
        if mol is None:
                print 'BAD MOLECULE: ',count
                badmollog.write(supplier.GetItemText(i))
                continue
        molH = Chem.AddHs(mol)
        if embedsubroutine(molH) == 0:
                print 'Embeded, now optimizing: ',count
                if optimizesubroutine(molH)==0:
                        converted.write(Chem.MolToMolBlock(molH))
                        converted.write('\n$$$$\n')
                else:
                        print 'Failure with optimzation',count
                        badmollog.write(supplier.GetItemText(i))
        else:
                print 'Due failed embed, cannot optimze',count
                badmollog.write(supplier.GetItemText(i))
                continue
#end of for_loop

badmollog.close()
converted.close()
exit

######### END OF FILE  ##########

On Mar 8, 2009, at 10:36 PM, Greg Landrum wrote:

Dear Marshall,

On Sat, Mar 7, 2009 at 5:26 AM, Marshall Levesque
<[email protected]> wrote:

Your suggestion fixed my problems with the file I was working with. I am a
novice with both to RDKit and Python.

Glad to hear that the script fixed at least part of the problems.


I have a new question for you: I am using the model script you provided,
but with a different set of compounds.

I will reorder your remaining two questions since the first has an
easier (and more fundamental) answer.

AND THEN, I tried to stop using the UFFOptimizeMolecule function and use
only EmbedMolecule and I received this error:

############
embed: 613
embed: 614
embed: 615
embed: 616
embed: 617
embed: 618
[20:03:38] Explicit valence for atom # 0 N greater than permitted
[20:03:38] Unexpected error hit on line 38532
[20:03:38] ERROR: moving to the begining of the next molecule
embed: 619
Traceback (most recent call last):
File "2-3D.py", line 10, in <module>
  AllChem.EmbedMolecule(mol)
Boost.Python.ArgumentError: Python argument types in
  Chem.rdDistGeom.EmbedMolecule(NoneType)
did not match C++ signature:
EmbedMolecule(RDKit::ROMol {lvalue} mol, unsigned int maxAttempts=0, int randomSeed=-1, bool clearConfs=True, bool useRandomCoords=False, double
boxSizeMult=2.0, bool randNegEig=True, unsigned int numZeroFail=1,
boost::python::dict {lvalue} coordMap={})

This type of error message, the Boost.Python.ArgumentError where you have:

Boost.Python.ArgumentError: Python argument types in
  SOME_FUNCTION(NoneType)
did not match C++ signature:
  SOME_FUNCTION(RDKit::ROMol {lvalue} mol, OTHER_ARGUMENTS)

Is the RDKit/Python way of telling you that you have a molecule that
the RDKit either failed to process or rejected.

I think this is the #618 and #619 compound(s) that caused the problem:
http://www.emolecules.com/cgi-bin/more?vid=5753883
http://www.emolecules.com/cgi-bin/more?vid=15920212

The first molecule looks ok to me (and I can read it in from the
SMILES given on your web page). The second has a neutral
four-coordinate nitrogen, so it's going to be rejected by the RDKit,
which is very picky about valence.

If there are valence problems, I can just record those errors and throw out those molecules. But I do not understand why the FileParseException does not succeed in proceeding to the next molecule. I am assuming it has to do with the way the for-loop is constructed, but I wasn't able to figure it
out.

The way to skip molecules the RDKit considers "bad" is to know that
the molecule processing machinery returns the special value None when
it encounters a molecule it's unhappy with. You can test for this in
your for loop:

####
for i,mol in enumerate(supplier):
 if mol is None:
   # record or report error, supplier.GetItemText(i) might be useful
   continue
 # do whatever else you want to do with the molecule
 mols.append(mol)
####

I included two placeholders here for your code. The important part is
the test to see if the molecule is None and the continue, which causes
the rest of the loop body to be skipped for that molecule.


##### SCRIPT to embed and optimize  #########
import Chem
from Chem import AllChem
suppl = Chem.SDMolSupplier('usdrug.700.sdf')

mols = []
for i,mol in enumerate(suppl):
      num = mol.GetNumAtoms()
       print 'embed:',i
       AllChem.EmbedMolecule(mol)
      print 'optimize:',i
      AllChem.UFFOptimizeMolecule(mol,500)
       mols.append(mol)

w = Chem.SDWriter('output.sdf')
for mol in mols: w.write(mol)

##########################

###### OUTPUT error ########
embed: 529
optimize: 529
embed: 530
optimize: 530
embed: 531
optimize: 531
embed: 532
optimize: 532
Traceback (most recent call last):
 File "2-3D.py", line 12, in <module>
   AllChem.UFFOptimizeMolecule(mol,500)
ValueError: Bad Conformer Id

###############################

I think I narrowed the molecule down to either of these two (with the first
looking more likely to be the problem):
http://www.emolecules.com/cgi-bin/more?vid=5852064
http://www.emolecules.com/cgi-bin/more?vid=1985607

What is happening here is that the call to EmbedMolecule is failing to
generate a conformation. This can happen with very large, flexible
molecules like your two examples. The way you check this is to test
the return value from EmbedMolecule(), which returns -1 if it fails to
generate a conformation.

In cases where the standard embedding algorithm fails (returns -1),
you can try calling EmbedMolecule again with the useRandomCoords
argument set to True:

####
[11]>>> m = Chem.MolFromSmiles('O=C(o...@h]1o[c@H] (COC(=O)c2cc(O)c(O)c(OC(=O)c3cc(O)c(O)c(O)c3)c2)[C@@H] (OC(=O)c2cc(O)c(O)c(OC(=O)c3cc(O)c(O)c(O)c3)c2)[...@h] (OC(=O)c2cc(O)c(O)c(OC(=O)c3cc(O)c(O)c(O)c3)c2) [C @H ]1OC (= O )c1cc (O )c (O )c (OC(=O)c2cc(O)c(O)c(O)c2)c1)c1cc(O)c(O)c(OC(=O)c2cc(O)c(O)c(O)c2)c1')

[12]>>> AllChem.EmbedMolecule(m)
Out[12] -1      # <--- indicates failure

[13]>>> AllChem.EmbedMolecule(m,useRandomCoords=True)
Out[13] 0      # <---- that time it worked
####

I will leave it to you to set this up in your code.

Note 1: since the useRandomCoords method is more robust (and faster)
for larger molecules, it might be worth thinking about testing the
size of the molecule (via mol.GetNumAtoms()) and automatically using
useRandomCoords for larger systems.

Note 2: Now that I look at your code with my "chemistry" eyes on
(instead of my "python help" eyes), I notice that you are generating
the conformations without Hs on the molecules. This sometimes will
give you geometries that aren't particularly happy. You might want to
consider using Chem.AddHs before you generate a conformation and then
calling Chem.RemoveHs before you write the molecule back out (assuming
you don't want the Hs in the output SD file). Note that both
Chem.AddHs and Chem.RemoveHs return new molecules.

Best Regards,
-greg


Reply via email to