Re: [Rdkit-discuss] RDKit - DbCLI

Robert DeLisle Thu, 20 Nov 2008 23:38:50 +0000

Greg,

After running through the process with exception handling in place I was
able to isolate 10 structures that were being problematic.  All of them had
at least one bond designated as 0 order in the SD file - much as you found
for some of the other structures previously.  I assume that these passed the
initial import step but are failing upon descriptor generation for obvious
reasons.


I suppose the only request that I have is for more graceful error handling.
I've attached my (admittedly sloppy) version of CreateDB.py showing what I
did to isolate the errors.

-Kirk





On Thu, Nov 20, 2008 at 1:33 PM, <rkdeli...@gmail.com> wrote:

> Indeed I can. Luckily I had a console window open with the error in place
> just as I saw your message:
>
>
>
> [13:21:16] INFO: Done: 54500
> Traceback (most recent call last):
> File "C:\RDKit_Q32008_1\Projects\dbcli\CreateDB.py", line 222, in <module>
> mol = Chem.Mol(str(pkl))
> RuntimeError: Unknown exception
>
>
> I've just wrapped this one in a try-catch block as well.
>
>
>
>
> On Nov 20, 2008 1:17pm, Greg Landrum <greg.land...@gmail.com> wrote:
> > Can you send me the console output without disclosing things you
> >
> > oughtn't to disclose?
> >
> >
> >
> > FYI: the deprecation warnings ought not to be causing the problem.
> >
> > There ought to be a bug report filed against this already, but it
> >
> > looks like I forgot to submit it. grn.
> >
> >
> >
> > -greg
> >
> >
> >
> > On Thu, Nov 20, 2008 at 9:06 PM,   wrote:
> >
> > > Greg,
> >
> > >
> >
> > > Thanks for the quick response.
> >
> > >
> >
> > > In reading my original question I realize I didn't explain myself well.
> >
> > > Sorry about that. 8^)
> >
> > >
> >
> > > I'm trying to set up a database of ~100,000 structures which will be
> queried
> >
> > > by very few structures at a time. While running CreateDB.py I get to
> the
> >
> > > step that gives an output of:
> >
> > >
> >
> > > 'Generating fingerprints and descriptors:'
> >
> > >
> >
> > > In reading the output more closely I see that there are some
> deprecation
> >
> > > warnings that mention a distance matrix - that's where my original
> question
> >
> > > regarding a pairwise computation step came from. Regardless, after
> around
> >
> > > 50,000 structures, I get a 'Runtime: unexpected exception' message and
> >
> > > Python stops. Having done a bit more research I see that each molecule
> is
> >
> > > passed through Atom Pair, Fingerprint, and Descriptor generation. I
> assume
> >
> > > it is failing somewhere within those steps, but I haven't yet
> identified
> >
> > > where or why. I have just wrapped all of those procedures in try-catch
> >
> > > blocks in hopes of finding the offending structure. Once I have it,
> I'll do
> >
> > > some tests on it and send it your way.
> >
> > >
> >
> > > -Kirk
> >
> > >
> >
> > >
> >
> > >
> >
> > > On Nov 20, 2008 12:41pm, Greg Landrum wrote:
> >
> > >> [moving a general-interest question to the mailing list]
> >
> > >>
> >
> > >>
> >
> > >>
> >
> > >> Hi Kirk,
> >
> > >>
> >
> > >>
> >
> > >>
> >
> > >> On Thu, Nov 20, 2008 at 6:03 PM,   wrote:
> >
> > >>
> >
> > >> >
> >
> > >>
> >
> > >> > I have another question on DbCLI. After getting rid of problematic
> >
> > >>
> >
> > >> > structures, I was able to get DbCLI to the pairwise comparison step,
> but
> >
> > >> > my
> >
> > >>
> >
> > >>
> >
> > >>
> >
> > >> I'm not sure what the "pairwise comparison step" is with the DbCLI
> stuff.
> >
> > >>
> >
> > >> Step one is loading the database with CreateDb.py, step 2 is doing
> >
> > >>
> >
> > >> searches with SearchDb.py. What are you asking about?
> >
> > >>
> >
> > >>
> >
> > >>
> >
> > >> > dataset has on the order of 100,000 structures. After about 50,000
> >
> > >>
> >
> > >> > structures Python issued an "Unexpected error" response and stopped.
> Is
> >
> > >> > this
> >
> > >>
> >
> > >> > likely due to the enormous size of a pairwise distance table for
> this
> >
> > >>
> >
> > >> > dataset? Have to had problems with very large datasets in the past
> or
> >
> > >> > has
> >
> > >>
> >
> > >> > this typically worked smoothly?
> >
> > >>
> >
> > >>
> >
> > >>
> >
> > >> I must admit that I've never queried with that number of structures.
> >
> > >>
> >
> > >> My typical use case is to have a large database (10^5-10^6 compounds)
> >
> > >>
> >
> > >> and query that with a few (~10) structures. The code hasn't really
> >
> > >>
> >
> > >> been written to deal with giant query sets. That is doable, but it
> >
> > >>
> >
> > >> would require some reworking. Probably the best bet would be to
> >
> > >>
> >
> > >> support loading the queries from a database as well; that way you
> >
> > >>
> >
> > >> wouldn't have to reprocess the queries every time and could pretty
> >
> > >>
> >
> > >> easily handle the "only loading a few at a time" problem.
> >
> > >>
> >
> > >>
> >
> > >>
> >
> > >> It's an interesting thing to think about.
> >
> > >>
> >
> > >>
> >
> > >>
> >
> > >> -greg
> >
> > >>
> >
>

# $Id: CreateDb.py 665 2008-05-15 04:33:40Z glandrum $
#
#  Copyright (c) 2007, Novartis Institutes for BioMedical Research Inc.
#  All rights reserved.
# 
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are
# met: 
#
#     * Redistributions of source code must retain the above copyright 
#       notice, this list of conditions and the following disclaimer.
#     * Redistributions in binary form must reproduce the above
#       copyright notice, this list of conditions and the following 
#       disclaimer in the documentation and/or other materials provided 
#       with the distribution.
#     * Neither the name of Novartis Institutes for BioMedical Research Inc. 
#       nor the names of its contributors may be used to endorse or promote 
#       products derived from this software without specific prior written 
permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
# OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
# Created by Greg Landrum, July 2007
_version = "0.7.2"
_usage="""
 CreateDb [optional arguments] <filename>

  NOTES:

    - the property names for the database are the union of those for
      all molecules.

    - missing property values will be set to 'N/A', though this can be
      changed with the --missingPropertyVal argument.
    
    - The property names may be altered on loading the database.  Any
      non-alphanumeric character in a property name will be replaced
      with '_'. e.g. "Gold.Goldscore.Constraint.Score" becomes
      "Gold_Goldscore_Constraint_Score".  This is important to know
      when querying.

    - Property names are not case sensitive in the database; this may
      cause some problems if they are case sensitive in the sd file.

      
"""
import RDConfig
import Chem
from Dbase.DbConnection import DbConnect
from Dbase import DbModule
from RDLogger import logger
from  Chem.MolDb import Loader

logger = logger()
import cPickle,sys,os

# ---- ---- ---- ----  ---- ---- ---- ----  ---- ---- ---- ----  ---- ---- ---- 
---- 
from optparse import OptionParser
parser=OptionParser(_usage,version='%prog '+_version)
parser.add_option('--outDir','--dbDir',default='.',
                  help='name of the output directory')
parser.add_option('--molDbName',default='Compounds.sqlt',
                  help='name of the molecule database')
parser.add_option('--molIdName',default='compound_id',
                  help='name of the database key column')
parser.add_option('--regName',default='molecules',
                  help='name of the molecular registry table')
parser.add_option('--pairDbName',default='AtomPairs.sqlt',
                  help='name of the atom pairs database')
parser.add_option('--pairTableName',default='atompairs',
                  help='name of the atom pairs table')
parser.add_option('--fpDbName',default='Fingerprints.sqlt',
                  help='name of the 2D fingerprints database')
parser.add_option('--fpTableName',default='rdkitfps',
                  help='name of the 2D fingerprints table')
parser.add_option('--descrDbName',default='Descriptors.sqlt',
                  help='name of the descriptor database')
parser.add_option('--descrTableName',default='descriptors_v1',
                  help='name of the descriptor table')
parser.add_option('--descriptorCalcFilename',default=os.path.join(RDConfig.RDBaseDir,'Projects',
                                                                  
'DbCLI','moe_like.dsc'),
                  help='name of the file containing the descriptor calculator')
parser.add_option('--errFilename',default='loadErrors.txt',
                  help='name of the file to contain information about molecules 
that fail to load')
parser.add_option('--noPairs',default=True,dest='doPairs',action='store_false',
                  help='skip calculating atom pairs')
parser.add_option('--noFingerprints',default=True,dest='doFingerprints',action='store_false',
                  help='skip calculating 2D fingerprints')
parser.add_option('--noOldFingerprints',default=True,dest='doOldFps',action='store_false',
                  help='skip calculating 2D fingerprints using the old 
fingerprinter')
parser.add_option('--noDescriptors',default=True,dest='doDescriptors',action='store_false',
                  help='skip calculating descriptors')
parser.add_option('--noProps',default=False,dest='skipProps',action='store_true',
                  help="don't include molecular properties in the database")
parser.add_option('--noSmiles',default=False,dest='skipSmiles',action='store_true',
                  help="don't include SMILES in the database (can make loading 
somewhat faster)")
parser.add_option('--maxRowsCached',default=-1,
                  help="maximum number of rows to cache before doing a database 
commit")

parser.add_option('--silent',default=False,action='store_true',
                  help='do not provide status messages')

parser.add_option('--molFormat',default='smiles',choices=('smiles','sdf'),
                  help='specify the format of the input file')
parser.add_option('--nameProp',default='_Name',
                  help='specify the SD property to be used for the molecule 
names. Default is to use the mol block name')
parser.add_option('--missingPropertyVal',default='N/A',
                  help='value to insert in the database if a property value is 
missing. Default is %default.')
parser.add_option('--addProps',default=False,action='store_true',
                  help='add computed properties to the output')
parser.add_option('--noExtras',default=False,action='store_true',
                  help='skip all non-molecule databases')

parser.add_option('--delimiter','--delim',default=' ',
                  help='the delimiter in the input file')
parser.add_option('--titleLine',default=False,action='store_true',
                  help='the input file contains a title line')

if __name__=='__main__':
  options,args = parser.parse_args()
  if len(args)!=1:
    parser.error('please provide a filename argument')

  if not os.path.exists(options.outDir):
    try:
      os.mkdir(options.outDir)
    except: 
      logger.error('could not create output directory %s'%options.outDir)
      sys.exit(1)
  molConn = DbConnect(os.path.join(options.outDir,options.molDbName))
  dataFilename = args[0]
  dataFile = file(dataFilename,'r')
  dataFile=None
  errFile=file(os.path.join(options.outDir,options.errFilename),'w+')
  
  if options.molFormat=='smiles':
    supplier=Chem.SmilesMolSupplier(dataFilename,
                                    titleLine=options.titleLine,
                                    delimiter=options.delimiter)
  else:
    supplier = Chem.SDMolSupplier(dataFilename)

  if options.noExtras:
    options.doPairs=False
    options.doDescriptors=False
    options.doFingerprints=False

  if not options.silent: logger.info('Reading molecules and constructing 
molecular database.')
  Loader.LoadDb(supplier,os.path.join(options.outDir,options.molDbName),
                
errorsTo=errFile,regName=options.regName,nameCol=options.molIdName,
                
skipProps=options.skipProps,defaultVal=options.missingPropertyVal,
                addComputedProps=options.addProps,uniqNames=True,
                
skipSmiles=options.skipSmiles,maxRowsCached=int(options.maxRowsCached),
                
silent=options.silent,nameProp=options.nameProp,lazySupplier=False)
  if options.doPairs:
    from Chem.AtomPairs import Pairs,Torsions
    pairConn = DbConnect(os.path.join(options.outDir,options.pairDbName))
    pairCurs = pairConn.GetCursor()
    try:
      pairCurs.execute('drop table %s'%(options.pairTableName))
    except:
      pass
    pairCurs.execute('create table %s (%s varchar not null primary 
key,atompairfp blob,torsionfp blob)'%(options.pairTableName,
                                                                                
                         options.molIdName))

  if options.doFingerprints:
    fpConn = DbConnect(os.path.join(options.outDir,options.fpDbName))
    fpCurs=fpConn.GetCursor()
    try:
      fpCurs.execute('drop table %s'%(options.fpTableName))
    except:
      pass
    fpCurs.execute('create table %s (%s varchar not null primary 
key,autofragmentfp blob,rdkfp blob)'%(options.fpTableName,
                                                                                
                       options.molIdName))
    from Chem.Fingerprints import FingerprintMols
    details = FingerprintMols.FingerprinterDetails()
    fpArgs = details.__dict__
  if options.doDescriptors:
    descrConn=DbConnect(os.path.join(options.outDir,options.descrDbName))
    calc = cPickle.load(file(options.descriptorCalcFilename,'rb'))
    nms = [x for x in calc.GetDescriptorNames()]
    descrCurs = descrConn.GetCursor()
    descrs = ['%s varchar not null primary key'%options.molIdName]
    descrs.extend(['%s float'%x for x in nms])
    try:
      descrCurs.execute('drop table %s'%(options.descrTableName))
    except:
      pass
    descrCurs.execute('create table %s 
(%s)'%(options.descrTableName,','.join(descrs)))
    descrQuery=','.join([DbModule.placeHolder]*len(descrs))
  pairRows = []
  fpRows = []
  descrRows = []

  if not options.silent: logger.info('Generating fingerprints and descriptors:')
  molCurs = molConn.GetCursor()
  if not options.skipSmiles:
    molCurs.execute('select %s,smiles,molpkl from 
%s'%(options.molIdName,options.regName))
  else:
    molCurs.execute('select %s,molpkl from 
%s'%(options.molIdName,options.regName))
  i=0
  
  fout = open('Attempts.log','w')
  
  while 1:
    try:
      tpl = molCurs.fetchone()
      id = tpl[0]
      pkl = tpl[-1]
      i+=1
    except:
      break
    
    try:
        mol = Chem.Mol(str(pkl))
    except:
        fout.write('Failed: id= ' + str(id) + '\n')
        fout.flush()
        
    if not mol: continue
    
    if options.doPairs:
                try:
                        pairs = Pairs.GetAtomPairFingerprintAsIntVect(mol)
                        torsions = 
Torsions.GetTopologicalTorsionFingerprintAsIntVect(mol)
                        pkl1 = DbModule.binaryHolder(pairs.ToBinary())
                        pkl2 = DbModule.binaryHolder(torsions.ToBinary())
                        row = [id,pkl1,pkl2]
                        pairRows.append(row)
                        if len(pairRows)>=500:
                                pairCurs.executemany('insert into %s values 
(?,?,?)'%options.pairTableName,
                                                     pairRows)
                                pairRows = []
                                pairConn.Commit()

                except:
                        fout.write('AP failure - ')
                        fout.flush()
                  
    if options.doFingerprints:
                try:
                        if options.doOldFps:
                                fp = 
FingerprintMols.FingerprintMol(mol,**fpArgs)
                                pkl1 = DbModule.binaryHolder(fp.ToBinary())
                        else:
                                pkl1 = ''
                        fp2 = Chem.RDKFingerprint(mol)
                        pkl2 = DbModule.binaryHolder(fp2.ToBinary())
                        row = [id,pkl1,pkl2]
                        fpRows.append(row)
                        if len(fpRows)>=500:
                                fpCurs.executemany('insert into %s values 
(?,?,?)'%options.fpTableName,
                                                   fpRows)
                                fpRows = []
                                fpConn.Commit()
                except:
                        fout.write('FP failure - ')
                        fout.flush()

    if options.doDescriptors:
                try:
                        descrs= calc.CalcDescriptors(mol)
                        row = [id]
                        row.extend(descrs)
                        descrRows.append(row)
                        if len(descrRows)>=500:
                                descrCurs.executemany('insert into %s values 
(%s)'%(options.descrTableName,descrQuery),
                                                      descrRows)
                                descrRows = []
                                descrConn.Commit()
                except:
                        fout.write('Descr failure\n')
                        fout.flush()

    if not options.silent and not i%500: 
      logger.info('  Done: %d'%(i))

  if len(pairRows):
    pairCurs.executemany('insert into %s values (?,?,?)'%options.pairTableName,
                         pairRows)
    pairRows = []
    pairConn.Commit()
  if len(fpRows):
    fpCurs.executemany('insert into %s values (?,?,?)'%options.fpTableName,
                       fpRows)
    fpRows = []
    fpConn.Commit()
  if len(descrRows):
    descrCurs.executemany('insert into %s values 
(%s)'%(options.descrTableName,descrQuery),
                          descrRows)
    descrRows = []
    descrConn.Commit()

Re: [Rdkit-discuss] RDKit - DbCLI

Reply via email to