Re: [Rdkit-discuss] Rdkit-discuss Digest, Vol 91, Issue 1

Michael Reutlinger Sat, 02 May 2015 14:54:03 -0700

Hi Andrew,

thanks for your helpful and detailed email. Your chemfp package is clearly
also an alternative to use. Lets see how the discussion evolves as I would
love if this could be part of the standard RDKit.


I completely agree that it would also be nice to have an convenient way to
get the error messages using something like the proposed mechanism.

Best,
Michael


> On May 1, 2015, at 12:01 AM, Michael Reutlinger wrote:
> > However, in some cases this does not help. E.g. when an unknown atom
> (most of the time this is X) is found in the MolBlock the import fails with
> an Post-condition Violation and None is yielded. This is fine to detect the
> problem BUT it is impossible to get any information about the molecule
> which failed.
> As a backup solution, outside of RDKit, you might try my chemfp package,
> available from
> https://chem-fingerprints.googlecode.com/files/chemfp-1.1.tar.gz
> (Hmm, looks like I need to migrate that away from Google Code.)
> One of the internal functions [*] has a way to read individual SDF records
> as text:
>   >>> for record in sdf_reader.open_sdf("tests/pubchem.sdf"):
>   ...   print record.split("\n", 1)[0]
>   ...
>   9425004
>   9425009
>   9425012
>   9425015
>   9425018
>   9425021
> If you use the bit of code in this email after my signature you can
> extract the tag/data pair from the record:
> >>> from chemfp import sdf_reader
> >>> for record in sdf_reader.open_sdf("tests/pubchem.sdf"):
> ...   id = record.partition("\n")[0]
> ...   tags = dict(get_sdf_tag_pairs(record))
> ...   print id, tags["PUBCHEM_OPENEYE_ISO_SMILES"]
> ...
> 9425004 CC1=CC(=NN1CC(=O)NNC(=O)\C=C\C2=C(C=CC=C2Cl)F)C
> 9425009 CC1=CC(=NN1CC(=O)NNC(=O)CCC2=NC(=NO2)C3=CC=CC=C3)C
> 9425012 CCC1=NOC(=C1C(=O)NNC(=O)CN2C(=CC(=N2)C)C)C
> 9425015 CC1=CC(=NN1CC(=O)NNC(=O)CCC(=O)C2=CC=C(C=C2)C3=CC=CC=C3)C
> 9425018 CC1=CC(=NN1CC(=O)NNC(=O)C2=CC=CC=C2SCC(=O)N(C)C)C
>   ...
> I also included a function called "MolFromSDBlock" which is like
> "MolFromMolBlock" except that it also copies over the tag data as
> properties. In that way you can get what you want from RDKit like this:
>
> >>> for record in sdf_reader.open_sdf("/Users/
> dalke/databases/chembl_14.sdf"):
> ...   mol = MolFromSDBlock(record)
> ...   if mol is None:
> ...     print "Could not process", dict(get_sdf_tag_pairs(record)
> )["chembl_id"]
> ...   else:
> ...     print mol.GetProp("chembl_id"), mol.GetNumAtoms()
> ...
> CHEMBL438581 165
> CHEMBL155459 44
> CHEMBL154288 52
> CHEMBL443179 56
> CHEMBL443183 92
> CHEMBL443332 18
>   ..
> CHEMBL265763 40
> [01:03:52] Explicit valence for atom # 0 B, 5, is greater than permitted
> Could not process CHEMBL268118
> CHEMBL265830 29
>   ...
> I've also sketched out a solution which returns an empty molecule with the
> "_Name", "_Error", and properties set from the SD tag. There's only one
> line to comment out to get it, but I've not actually tested that code path.
>
> Be aware that I wasn't quite as experienced in how to parse SD files when
> I wrote code for chemfp-1.1 some 5 years ago. For example, you shouldn't
> have tag data with a line starting with a '>'.
>
> [*] By "internal" I mean that it's not documented and not part of the
> stable API. In fact, it has changed in more recent versions of chemfp,
> where similar functionality is now part of the stable API. However, those
> more recent versions, while still free/open source software, are a
> commercial product and costs money.
> Contact me if you are interesting in purchasing a copy. :)
> > My question is if there is a way to get to the data even for those
> cases? The files tend to be very big so accessing the molecule re-parsing
> it line-by-line in python to get the name for a specific molecule number
> (found by enumerating the supplier) is not really an option.
> My timing numbers for chemfp-1.1 had about the same performance as RDKit's
> own parser. In newer versions I fixed some of the corner cases, and rewrote
> the code in C for better performance.
>
> > What would be a good solution in my opinion is to create an empty
> molecule with all sd properties, including _Name, in case of an error
> instead of None. The actual error could then also be communicated into
> python via an '_Error' property.
>   ...
> > Maybe this behaviour could be activated via an option and the default
> would be to return None, to not break any existing code.
> It would have to be via an option, for exactly the reason you highlighted.
> The option might look like:
>    ForwardSDMolSupplier(...., onError=handler)
> The simplest is if "handler" is one of a handful of possible string values:
>   - "None" to return None on failure; the current behavior
>   - "ErrorMol" to return an error molecule like you describe
> Personally, I would love some easy way from the Python API to get access
> to the warning and error messages without having to intercept the log
> messages. I think that something like this is the way to get there.

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Rdkit-discuss Digest, Vol 91, Issue 1

Reply via email to