Re: [Rdkit-discuss] SDF properties in case of error

2015-05-02 Thread Greg Landrum
Hi Michael,

What you request is certainly possible, but it is a pretty fundamental
change in the way the supplier (and mol file parser) works, so it would
need some thought.

Once concern that immediately occurs to me is that you will not be able to
tell which molecules from the input file were actually empty in the input
and which were just empty because there was a problem parsing an input
molecule.

A possible alternative, more general and somewhat lighter weight, would be
to ensure that you can always get the text of the last item parsed from a
ForwardSDMolSupplier (a method like: suppl.GetLastItemText()); this would
allow you to do whatever special error handling you are interested in doing

-greg


On Fri, May 1, 2015 at 12:01 AM, Michael Reutlinger rd...@mulchi.de wrote:

 Hi all,

 I am currently working on a program which needs to process libraries of
 large SDF files. One requirement is to always produce a valid output
 including the molecule title/name or a specified property for referencing.

 With specifying sanitize=False with ForwardSDMolSupplier and using
 Chem.Sanitize afterwards with an appropriate Exception handling helps in
 most cases to get the SD file properties and still detect errors in the
 molecules to avoid importing rubbish.

 However, in some cases this does not help. E.g. when an unknown atom (most
 of the time this is X) is found in the MolBlock the import fails with an
 Post-condition Violation and None is yielded. This is fine to detect the
 problem BUT it is impossible to get any information about the molecule
 which failed.

 My question is if there is a way to get to the data even for those cases?
 The files tend to be very big so accessing the molecule re-parsing it
 line-by-line in python to get the name for a specific molecule number
 (found by enumerating the supplier) is not really an option.

 What would be a good solution in my opinion is to create an empty molecule
 with all sd properties, including _Name, in case of an error instead of
 None. The actual error could then also be communicated into python via an
 '_Error' property. With this it would still be possible to continue
 processing of the file in a for loop, in contrast to raising an Exception,
 and it is easy to check if the molecule is empty.
 Maybe this behaviour could be activated via an option and the default
 would be to return None, to not break any existing code.

 I am very keen on getting your view on this issue.

 Best regards,
 Michael


 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SDF properties in case of error

2015-05-02 Thread Michael Reutlinger
Hi Greg,

thanks for your answer, I agree that the lighter weighted solution is
certainly also a possibility and would clearly solve my (and possibly
others) problem. Maybe a suppl.GetLastItemError() would then also be handy
to get the error messages that usually are only visible in the log.

But maybe something like an ErrorMol (as described in more detail by Andrew
Dalke) could potentially be more versatile. If an ErrorMol class is
inherited from Mol it could be processed in a standard way but one could
clearly differentiate this vehicle from an empty molecule. By having
different handlers, it would also be possible to add Exceptions in the
future, if people prefer having this behaviour :-)

However, both implementations would be a big improvement and could help to
avoid dealing with special cases somewhere else in the workflow, leading to
more robust workflows and eventually less errors.

Have a nice weekend,
Michael




On Sat, May 2, 2015 at 2:25 PM, Greg Landrum greg.land...@gmail.com wrote:

 Hi Michael,

 What you request is certainly possible, but it is a pretty fundamental
 change in the way the supplier (and mol file parser) works, so it would
 need some thought.

 Once concern that immediately occurs to me is that you will not be able to
 tell which molecules from the input file were actually empty in the input
 and which were just empty because there was a problem parsing an input
 molecule.

 A possible alternative, more general and somewhat lighter weight, would be
 to ensure that you can always get the text of the last item parsed from a
 ForwardSDMolSupplier (a method like: suppl.GetLastItemText()); this would
 allow you to do whatever special error handling you are interested in doing

 -greg


 On Fri, May 1, 2015 at 12:01 AM, Michael Reutlinger rd...@mulchi.de
 wrote:

 Hi all,

 I am currently working on a program which needs to process libraries of
 large SDF files. One requirement is to always produce a valid output
 including the molecule title/name or a specified property for referencing.

 With specifying sanitize=False with ForwardSDMolSupplier and using
 Chem.Sanitize afterwards with an appropriate Exception handling helps in
 most cases to get the SD file properties and still detect errors in the
 molecules to avoid importing rubbish.

 However, in some cases this does not help. E.g. when an unknown atom
 (most of the time this is X) is found in the MolBlock the import fails with
 an Post-condition Violation and None is yielded. This is fine to detect the
 problem BUT it is impossible to get any information about the molecule
 which failed.

 My question is if there is a way to get to the data even for those cases?
 The files tend to be very big so accessing the molecule re-parsing it
 line-by-line in python to get the name for a specific molecule number
 (found by enumerating the supplier) is not really an option.

 What would be a good solution in my opinion is to create an empty
 molecule with all sd properties, including _Name, in case of an error
 instead of None. The actual error could then also be communicated into
 python via an '_Error' property. With this it would still be possible to
 continue processing of the file in a for loop, in contrast to raising an
 Exception, and it is easy to check if the molecule is empty.
 Maybe this behaviour could be activated via an option and the default
 would be to return None, to not break any existing code.

 I am very keen on getting your view on this issue.

 Best regards,
 Michael


 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Rdkit-discuss Digest, Vol 91, Issue 1

2015-05-02 Thread Michael Reutlinger
Hi Andrew,

thanks for your helpful and detailed email. Your chemfp package is clearly
also an alternative to use. Lets see how the discussion evolves as I would
love if this could be part of the standard RDKit.

I completely agree that it would also be nice to have an convenient way to
get the error messages using something like the proposed mechanism.

Best,
Michael


 On May 1, 2015, at 12:01 AM, Michael Reutlinger wrote:
  However, in some cases this does not help. E.g. when an unknown atom
 (most of the time this is X) is found in the MolBlock the import fails with
 an Post-condition Violation and None is yielded. This is fine to detect the
 problem BUT it is impossible to get any information about the molecule
 which failed.
 As a backup solution, outside of RDKit, you might try my chemfp package,
 available from
 https://chem-fingerprints.googlecode.com/files/chemfp-1.1.tar.gz
 (Hmm, looks like I need to migrate that away from Google Code.)
 One of the internal functions [*] has a way to read individual SDF records
 as text:
for record in sdf_reader.open_sdf(tests/pubchem.sdf):
   ...   print record.split(\n, 1)[0]
   ...
   9425004
   9425009
   9425012
   9425015
   9425018
   9425021
 If you use the bit of code in this email after my signature you can
 extract the tag/data pair from the record:
  from chemfp import sdf_reader
  for record in sdf_reader.open_sdf(tests/pubchem.sdf):
 ...   id = record.partition(\n)[0]
 ...   tags = dict(get_sdf_tag_pairs(record))
 ...   print id, tags[PUBCHEM_OPENEYE_ISO_SMILES]
 ...
 9425004 CC1=CC(=NN1CC(=O)NNC(=O)\C=C\C2=C(C=CC=C2Cl)F)C
 9425009 CC1=CC(=NN1CC(=O)NNC(=O)CCC2=NC(=NO2)C3=CC=CC=C3)C
 9425012 CCC1=NOC(=C1C(=O)NNC(=O)CN2C(=CC(=N2)C)C)C
 9425015 CC1=CC(=NN1CC(=O)NNC(=O)CCC(=O)C2=CC=C(C=C2)C3=CC=CC=C3)C
 9425018 CC1=CC(=NN1CC(=O)NNC(=O)C2=CC=CC=C2SCC(=O)N(C)C)C
   ...
 I also included a function called MolFromSDBlock which is like
 MolFromMolBlock except that it also copies over the tag data as
 properties. In that way you can get what you want from RDKit like this:

  for record in sdf_reader.open_sdf(/Users/
 dalke/databases/chembl_14.sdf):
 ...   mol = MolFromSDBlock(record)
 ...   if mol is None:
 ... print Could not process, dict(get_sdf_tag_pairs(record)
 )[chembl_id]
 ...   else:
 ... print mol.GetProp(chembl_id), mol.GetNumAtoms()
 ...
 CHEMBL438581 165
 CHEMBL155459 44
 CHEMBL154288 52
 CHEMBL443179 56
 CHEMBL443183 92
 CHEMBL443332 18
   ..
 CHEMBL265763 40
 [01:03:52] Explicit valence for atom # 0 B, 5, is greater than permitted
 Could not process CHEMBL268118
 CHEMBL265830 29
   ...
 I've also sketched out a solution which returns an empty molecule with the
 _Name, _Error, and properties set from the SD tag. There's only one
 line to comment out to get it, but I've not actually tested that code path.

 Be aware that I wasn't quite as experienced in how to parse SD files when
 I wrote code for chemfp-1.1 some 5 years ago. For example, you shouldn't
 have tag data with a line starting with a ''.

 [*] By internal I mean that it's not documented and not part of the
 stable API. In fact, it has changed in more recent versions of chemfp,
 where similar functionality is now part of the stable API. However, those
 more recent versions, while still free/open source software, are a
 commercial product and costs money.
 Contact me if you are interesting in purchasing a copy. :)
  My question is if there is a way to get to the data even for those
 cases? The files tend to be very big so accessing the molecule re-parsing
 it line-by-line in python to get the name for a specific molecule number
 (found by enumerating the supplier) is not really an option.
 My timing numbers for chemfp-1.1 had about the same performance as RDKit's
 own parser. In newer versions I fixed some of the corner cases, and rewrote
 the code in C for better performance.

  What would be a good solution in my opinion is to create an empty
 molecule with all sd properties, including _Name, in case of an error
 instead of None. The actual error could then also be communicated into
 python via an '_Error' property.
   ...
  Maybe this behaviour could be activated via an option and the default
 would be to return None, to not break any existing code.
 It would have to be via an option, for exactly the reason you highlighted.
 The option might look like:
ForwardSDMolSupplier(, onError=handler)
 The simplest is if handler is one of a handful of possible string values:
   - None to return None on failure; the current behavior
   - ErrorMol to return an error molecule like you describe
 Personally, I would love some easy way from the Python API to get access
 to the warning and error messages without having to intercept the log
 messages. I think that something like this is the way to get there.
--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box 

Re: [Rdkit-discuss] Help building the RDKit cookbook

2015-05-02 Thread Greg Landrum
On Thu, Apr 30, 2015 at 10:40 AM, JP jeanpaul.ebe...@inhibox.com wrote:

 The build for 'make singlehtml' works (so does 'make html' and removing
 the dep).  But the generated documentation has dead links.


That's expected.


 I think these APIs are needed because of the last (bottom) section
 'Additional Information' has links to the Python and C++ APIs, which of
 course are not present (and the link is dead).  (An ugly hack could be to
 link to the online versions of these ?).  Where are these two API docs
 stored on the file system?


They aren't stored anywhere by default; they have to be generated using
epydoc (for the python docs) and doxygen (for the C++ docs). You don't need
to worry about this while updating the other documentation though.

-greg
--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss