Re: [Rdkit-discuss] RDKit - DbCLI

2008-11-24 Thread Greg Landrum
Dear Kirk,

On Fri, Nov 21, 2008 at 12:38 AM, Robert DeLisle rkdeli...@gmail.com wrote:

 After running through the process with exception handling in place I was
 able to isolate 10 structures that were being problematic.  All of them had
 at least one bond designated as 0 order in the SD file - much as you found
 for some of the other structures previously.  I assume that these passed the
 initial import step but are failing upon descriptor generation for obvious
 reasons.

 I suppose the only request that I have is for more graceful error handling.
 I've attached my (admittedly sloppy) version of CreateDB.py showing what I
 did to isolate the errors.

The problem here was in the mol file parser: it was not correctly
setting up bonds that have order 0. Now it generates a warning (order
0 isn't technically allowed by the ctab spec) and sets the bond up
correctly. I also added some error checking to handle other bogus bond
orders.

This was entered as issue 2337369
(https://sourceforge.net/tracker2/?func=detailaid=2337369group_id=160139atid=814650)
and fixed in rev892.

-greg



Re: [Rdkit-discuss] RDKit - DbCLI

2008-11-24 Thread Robert DeLisle
Fantastic!  Thanks, Greg!

After I got things working I've been able to generate a database and do some
preliminary searches.  I'm impressed at how quickly I can search ~100,000
compounds with SMARTS patterns.  I have a feeling this one is going to get a
lot of use.

-Kirk



On Mon, Nov 24, 2008 at 12:45 AM, Greg Landrum greg.land...@gmail.comwrote:

 Dear Kirk,

 On Fri, Nov 21, 2008 at 12:38 AM, Robert DeLisle rkdeli...@gmail.com
 wrote:
 
  After running through the process with exception handling in place I was
  able to isolate 10 structures that were being problematic.  All of them
 had
  at least one bond designated as 0 order in the SD file - much as you
 found
  for some of the other structures previously.  I assume that these passed
 the
  initial import step but are failing upon descriptor generation for
 obvious
  reasons.
 
  I suppose the only request that I have is for more graceful error
 handling.
  I've attached my (admittedly sloppy) version of CreateDB.py showing what
 I
  did to isolate the errors.

 The problem here was in the mol file parser: it was not correctly
 setting up bonds that have order 0. Now it generates a warning (order
 0 isn't technically allowed by the ctab spec) and sets the bond up
 correctly. I also added some error checking to handle other bogus bond
 orders.

 This was entered as issue 2337369
 (
 https://sourceforge.net/tracker2/?func=detailaid=2337369group_id=160139atid=814650
 )
 and fixed in rev892.

 -greg



Re: [Rdkit-discuss] RDKit - DbCLI

2008-11-24 Thread Greg Landrum
On Mon, Nov 24, 2008 at 5:11 PM, Robert DeLisle rkdeli...@gmail.com wrote:
 Fantastic!  Thanks, Greg!

 After I got things working I've been able to generate a database and do some
 preliminary searches.  I'm impressed at how quickly I can search ~100,000
 compounds with SMARTS patterns.  I have a feeling this one is going to get a
 lot of use.

glad to hear it looks useful. The search speed isn't terrible as it
is, but it could be made a ton faster by using a substructure
fingerprint. But that's for a later version.

-greg



Re: [Rdkit-discuss] RDKit - DbCLI

2008-11-20 Thread Robert DeLisle
Greg,

After running through the process with exception handling in place I was
able to isolate 10 structures that were being problematic.  All of them had
at least one bond designated as 0 order in the SD file - much as you found
for some of the other structures previously.  I assume that these passed the
initial import step but are failing upon descriptor generation for obvious
reasons.

I suppose the only request that I have is for more graceful error handling.
I've attached my (admittedly sloppy) version of CreateDB.py showing what I
did to isolate the errors.

-Kirk





On Thu, Nov 20, 2008 at 1:33 PM, rkdeli...@gmail.com wrote:

 Indeed I can. Luckily I had a console window open with the error in place
 just as I saw your message:



 [13:21:16] INFO: Done: 54500
 Traceback (most recent call last):
 File C:\RDKit_Q32008_1\Projects\dbcli\CreateDB.py, line 222, in module
 mol = Chem.Mol(str(pkl))
 RuntimeError: Unknown exception


 I've just wrapped this one in a try-catch block as well.




 On Nov 20, 2008 1:17pm, Greg Landrum greg.land...@gmail.com wrote:
  Can you send me the console output without disclosing things you
 
  oughtn't to disclose?
 
 
 
  FYI: the deprecation warnings ought not to be causing the problem.
 
  There ought to be a bug report filed against this already, but it
 
  looks like I forgot to submit it. grn.
 
 
 
  -greg
 
 
 
  On Thu, Nov 20, 2008 at 9:06 PM,   wrote:
 
   Greg,
 
  
 
   Thanks for the quick response.
 
  
 
   In reading my original question I realize I didn't explain myself well.
 
   Sorry about that. 8^)
 
  
 
   I'm trying to set up a database of ~100,000 structures which will be
 queried
 
   by very few structures at a time. While running CreateDB.py I get to
 the
 
   step that gives an output of:
 
  
 
   'Generating fingerprints and descriptors:'
 
  
 
   In reading the output more closely I see that there are some
 deprecation
 
   warnings that mention a distance matrix - that's where my original
 question
 
   regarding a pairwise computation step came from. Regardless, after
 around
 
   50,000 structures, I get a 'Runtime: unexpected exception' message and
 
   Python stops. Having done a bit more research I see that each molecule
 is
 
   passed through Atom Pair, Fingerprint, and Descriptor generation. I
 assume
 
   it is failing somewhere within those steps, but I haven't yet
 identified
 
   where or why. I have just wrapped all of those procedures in try-catch
 
   blocks in hopes of finding the offending structure. Once I have it,
 I'll do
 
   some tests on it and send it your way.
 
  
 
   -Kirk
 
  
 
  
 
  
 
   On Nov 20, 2008 12:41pm, Greg Landrum wrote:
 
   [moving a general-interest question to the mailing list]
 
  
 
  
 
  
 
   Hi Kirk,
 
  
 
  
 
  
 
   On Thu, Nov 20, 2008 at 6:03 PM,   wrote:
 
  
 
   
 
  
 
I have another question on DbCLI. After getting rid of problematic
 
  
 
structures, I was able to get DbCLI to the pairwise comparison step,
 but
 
my
 
  
 
  
 
  
 
   I'm not sure what the pairwise comparison step is with the DbCLI
 stuff.
 
  
 
   Step one is loading the database with CreateDb.py, step 2 is doing
 
  
 
   searches with SearchDb.py. What are you asking about?
 
  
 
  
 
  
 
dataset has on the order of 100,000 structures. After about 50,000
 
  
 
structures Python issued an Unexpected error response and stopped.
 Is
 
this
 
  
 
likely due to the enormous size of a pairwise distance table for
 this
 
  
 
dataset? Have to had problems with very large datasets in the past
 or
 
has
 
  
 
this typically worked smoothly?
 
  
 
  
 
  
 
   I must admit that I've never queried with that number of structures.
 
  
 
   My typical use case is to have a large database (10^5-10^6 compounds)
 
  
 
   and query that with a few (~10) structures. The code hasn't really
 
  
 
   been written to deal with giant query sets. That is doable, but it
 
  
 
   would require some reworking. Probably the best bet would be to
 
  
 
   support loading the queries from a database as well; that way you
 
  
 
   wouldn't have to reprocess the queries every time and could pretty
 
  
 
   easily handle the only loading a few at a time problem.
 
  
 
  
 
  
 
   It's an interesting thing to think about.
 
  
 
  
 
  
 
   -greg
 
  
 

# $Id: CreateDb.py 665 2008-05-15 04:33:40Z glandrum $
#
#  Copyright (c) 2007, Novartis Institutes for BioMedical Research Inc.
#  All rights reserved.
# 
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are
# met: 
#
# * Redistributions of source code must retain the above copyright 
#   notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above
#   copyright notice, this list of conditions and the following 
#   disclaimer in the documentation and/or