Re: [Rdkit-discuss] SDF properties in case of error
Hi Michael, What you request is certainly possible, but it is a pretty fundamental change in the way the supplier (and mol file parser) works, so it would need some thought. Once concern that immediately occurs to me is that you will not be able to tell which molecules from the input file were actually empty in the input and which were just empty because there was a problem parsing an input molecule. A possible alternative, more general and somewhat lighter weight, would be to ensure that you can always get the text of the last item parsed from a ForwardSDMolSupplier (a method like: suppl.GetLastItemText()); this would allow you to do whatever special error handling you are interested in doing -greg On Fri, May 1, 2015 at 12:01 AM, Michael Reutlinger rd...@mulchi.de wrote: Hi all, I am currently working on a program which needs to process libraries of large SDF files. One requirement is to always produce a valid output including the molecule title/name or a specified property for referencing. With specifying sanitize=False with ForwardSDMolSupplier and using Chem.Sanitize afterwards with an appropriate Exception handling helps in most cases to get the SD file properties and still detect errors in the molecules to avoid importing rubbish. However, in some cases this does not help. E.g. when an unknown atom (most of the time this is X) is found in the MolBlock the import fails with an Post-condition Violation and None is yielded. This is fine to detect the problem BUT it is impossible to get any information about the molecule which failed. My question is if there is a way to get to the data even for those cases? The files tend to be very big so accessing the molecule re-parsing it line-by-line in python to get the name for a specific molecule number (found by enumerating the supplier) is not really an option. What would be a good solution in my opinion is to create an empty molecule with all sd properties, including _Name, in case of an error instead of None. The actual error could then also be communicated into python via an '_Error' property. With this it would still be possible to continue processing of the file in a for loop, in contrast to raising an Exception, and it is easy to check if the molecule is empty. Maybe this behaviour could be activated via an option and the default would be to return None, to not break any existing code. I am very keen on getting your view on this issue. Best regards, Michael -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] SDF properties in case of error
Hi Greg, thanks for your answer, I agree that the lighter weighted solution is certainly also a possibility and would clearly solve my (and possibly others) problem. Maybe a suppl.GetLastItemError() would then also be handy to get the error messages that usually are only visible in the log. But maybe something like an ErrorMol (as described in more detail by Andrew Dalke) could potentially be more versatile. If an ErrorMol class is inherited from Mol it could be processed in a standard way but one could clearly differentiate this vehicle from an empty molecule. By having different handlers, it would also be possible to add Exceptions in the future, if people prefer having this behaviour :-) However, both implementations would be a big improvement and could help to avoid dealing with special cases somewhere else in the workflow, leading to more robust workflows and eventually less errors. Have a nice weekend, Michael On Sat, May 2, 2015 at 2:25 PM, Greg Landrum greg.land...@gmail.com wrote: Hi Michael, What you request is certainly possible, but it is a pretty fundamental change in the way the supplier (and mol file parser) works, so it would need some thought. Once concern that immediately occurs to me is that you will not be able to tell which molecules from the input file were actually empty in the input and which were just empty because there was a problem parsing an input molecule. A possible alternative, more general and somewhat lighter weight, would be to ensure that you can always get the text of the last item parsed from a ForwardSDMolSupplier (a method like: suppl.GetLastItemText()); this would allow you to do whatever special error handling you are interested in doing -greg On Fri, May 1, 2015 at 12:01 AM, Michael Reutlinger rd...@mulchi.de wrote: Hi all, I am currently working on a program which needs to process libraries of large SDF files. One requirement is to always produce a valid output including the molecule title/name or a specified property for referencing. With specifying sanitize=False with ForwardSDMolSupplier and using Chem.Sanitize afterwards with an appropriate Exception handling helps in most cases to get the SD file properties and still detect errors in the molecules to avoid importing rubbish. However, in some cases this does not help. E.g. when an unknown atom (most of the time this is X) is found in the MolBlock the import fails with an Post-condition Violation and None is yielded. This is fine to detect the problem BUT it is impossible to get any information about the molecule which failed. My question is if there is a way to get to the data even for those cases? The files tend to be very big so accessing the molecule re-parsing it line-by-line in python to get the name for a specific molecule number (found by enumerating the supplier) is not really an option. What would be a good solution in my opinion is to create an empty molecule with all sd properties, including _Name, in case of an error instead of None. The actual error could then also be communicated into python via an '_Error' property. With this it would still be possible to continue processing of the file in a for loop, in contrast to raising an Exception, and it is easy to check if the molecule is empty. Maybe this behaviour could be activated via an option and the default would be to return None, to not break any existing code. I am very keen on getting your view on this issue. Best regards, Michael -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Rdkit-discuss Digest, Vol 91, Issue 1
Hi Andrew, thanks for your helpful and detailed email. Your chemfp package is clearly also an alternative to use. Lets see how the discussion evolves as I would love if this could be part of the standard RDKit. I completely agree that it would also be nice to have an convenient way to get the error messages using something like the proposed mechanism. Best, Michael On May 1, 2015, at 12:01 AM, Michael Reutlinger wrote: However, in some cases this does not help. E.g. when an unknown atom (most of the time this is X) is found in the MolBlock the import fails with an Post-condition Violation and None is yielded. This is fine to detect the problem BUT it is impossible to get any information about the molecule which failed. As a backup solution, outside of RDKit, you might try my chemfp package, available from https://chem-fingerprints.googlecode.com/files/chemfp-1.1.tar.gz (Hmm, looks like I need to migrate that away from Google Code.) One of the internal functions [*] has a way to read individual SDF records as text: for record in sdf_reader.open_sdf(tests/pubchem.sdf): ... print record.split(\n, 1)[0] ... 9425004 9425009 9425012 9425015 9425018 9425021 If you use the bit of code in this email after my signature you can extract the tag/data pair from the record: from chemfp import sdf_reader for record in sdf_reader.open_sdf(tests/pubchem.sdf): ... id = record.partition(\n)[0] ... tags = dict(get_sdf_tag_pairs(record)) ... print id, tags[PUBCHEM_OPENEYE_ISO_SMILES] ... 9425004 CC1=CC(=NN1CC(=O)NNC(=O)\C=C\C2=C(C=CC=C2Cl)F)C 9425009 CC1=CC(=NN1CC(=O)NNC(=O)CCC2=NC(=NO2)C3=CC=CC=C3)C 9425012 CCC1=NOC(=C1C(=O)NNC(=O)CN2C(=CC(=N2)C)C)C 9425015 CC1=CC(=NN1CC(=O)NNC(=O)CCC(=O)C2=CC=C(C=C2)C3=CC=CC=C3)C 9425018 CC1=CC(=NN1CC(=O)NNC(=O)C2=CC=CC=C2SCC(=O)N(C)C)C ... I also included a function called MolFromSDBlock which is like MolFromMolBlock except that it also copies over the tag data as properties. In that way you can get what you want from RDKit like this: for record in sdf_reader.open_sdf(/Users/ dalke/databases/chembl_14.sdf): ... mol = MolFromSDBlock(record) ... if mol is None: ... print Could not process, dict(get_sdf_tag_pairs(record) )[chembl_id] ... else: ... print mol.GetProp(chembl_id), mol.GetNumAtoms() ... CHEMBL438581 165 CHEMBL155459 44 CHEMBL154288 52 CHEMBL443179 56 CHEMBL443183 92 CHEMBL443332 18 .. CHEMBL265763 40 [01:03:52] Explicit valence for atom # 0 B, 5, is greater than permitted Could not process CHEMBL268118 CHEMBL265830 29 ... I've also sketched out a solution which returns an empty molecule with the _Name, _Error, and properties set from the SD tag. There's only one line to comment out to get it, but I've not actually tested that code path. Be aware that I wasn't quite as experienced in how to parse SD files when I wrote code for chemfp-1.1 some 5 years ago. For example, you shouldn't have tag data with a line starting with a ''. [*] By internal I mean that it's not documented and not part of the stable API. In fact, it has changed in more recent versions of chemfp, where similar functionality is now part of the stable API. However, those more recent versions, while still free/open source software, are a commercial product and costs money. Contact me if you are interesting in purchasing a copy. :) My question is if there is a way to get to the data even for those cases? The files tend to be very big so accessing the molecule re-parsing it line-by-line in python to get the name for a specific molecule number (found by enumerating the supplier) is not really an option. My timing numbers for chemfp-1.1 had about the same performance as RDKit's own parser. In newer versions I fixed some of the corner cases, and rewrote the code in C for better performance. What would be a good solution in my opinion is to create an empty molecule with all sd properties, including _Name, in case of an error instead of None. The actual error could then also be communicated into python via an '_Error' property. ... Maybe this behaviour could be activated via an option and the default would be to return None, to not break any existing code. It would have to be via an option, for exactly the reason you highlighted. The option might look like: ForwardSDMolSupplier(, onError=handler) The simplest is if handler is one of a handful of possible string values: - None to return None on failure; the current behavior - ErrorMol to return an error molecule like you describe Personally, I would love some easy way from the Python API to get access to the warning and error messages without having to intercept the log messages. I think that something like this is the way to get there. -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box
Re: [Rdkit-discuss] Help building the RDKit cookbook
On Thu, Apr 30, 2015 at 10:40 AM, JP jeanpaul.ebe...@inhibox.com wrote: The build for 'make singlehtml' works (so does 'make html' and removing the dep). But the generated documentation has dead links. That's expected. I think these APIs are needed because of the last (bottom) section 'Additional Information' has links to the Python and C++ APIs, which of course are not present (and the link is dead). (An ugly hack could be to link to the online versions of these ?). Where are these two API docs stored on the file system? They aren't stored anywhere by default; they have to be generated using epydoc (for the python docs) and doxygen (for the C++ docs). You don't need to worry about this while updating the other documentation though. -greg -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss