On Apr 14, 2022, at 09:16, Gyro Funch <gyromagne...@gmail.com> wrote:
> I don't know the sdf format well, so please excuse my ignorance, but instead 
> of a custom parser, would it be possible to write a preprocessor to eliminate 
> the offending information? Perhaps something using regular expressions in 
> python, perl, sed, or awk?

The SDF format is too complicated to be parsed with a regular expression[1], 
and the failure modes often cannot be detected at the syntax level[2]. I 
suggest people may consider using chemfp for this [3].

[1] For example, in a V2000-formatted record, the number of atom records and 
the number of bond records are given by a repeat count. A traditional/formal 
regular expression does not support counts where the count from the pattern 
matching. 

Most regular expression engines have more powerful capabilities than formal 
regular expression, such as matches to back-reference captured groups. However, 
few support using a backreference as a repeat count.

I wrote one that did, which would let you specify

(?P<atom_count>...)(?P<bond_count>...) and so on)
(?P<atom>(?P<atom_x>.{10})(?P<atom_y>.{10}) and so on){atom_count}
(?P<bond>(?P<from_atom>.{3})(?P<from_bond>.{3}) and so on){bond_count}

but in practice, defining the grammar through a regular expression grammar was 
decidedly not easy!

I've wanted to experiment with using WUFFS to make a low-level SDF parser 
library, see
  https://github.com/google/wuffs


[2] For example, RDKit by default rejects atoms where the valence is too high. 
Detecting this in filter code calls for reverse-engineering what RDKit already 
does.

[3] Chemfp is best known as a fingerprint generation and search program. 
However, there are a few use cases where I wanted to have access to the input 
record (eg, to detect toolkit failures, or to add fingerprint data to the input 
record rather than round-tripping the SDF through a toolkit.) I did this by 
writing my own SDF record reader (in the "text_toolkit"), and writing a wrapper 
to the RDKit toolkit (in the "rdkit_toolkit"), and using a error handler which 
can decide how to handle errors (ignore, report, raise an exception, log, 
etc.). That error handler has access to location information, which includes 
the record number, the record text, the line number of the start of the record, 
and more.

Here's what it looks like for Giovanni's use case:


from chemfp import rdkit_toolkit as T
from chemfp import text_toolkit

filename = "/Users/dalke/databases/ChEBI_complete_3star.sdf.gz"


class ErrorHandler:
    def __init__(self):
        self.error_ids = []
        
    def error(self, msg, location, extra=None):
        record = location.record
        chebi_id = text_toolkit.get_sdf_tag(location.record, "ChEBI ID")
        print(f"!!! Error reading record {location.recno} with ID: 
{chebi_id!r}")
        print(f"    at {location.where()}")
        self.error_ids.append(chebi_id)

errors = ErrorHandler()
count = 0
num_atoms = 0
for mol in T.read_molecules(filename, errors=errors):
    count += 1
    num_atoms += mol.GetNumAtoms()  # This is a RDMol.

print(f"Parsed {count} records ({num_atoms} atoms), skipped 
{len(errors.error_ids)}.")


This functionality is available in the pre-compiled version of chemfp for 
Linux-base OSes, available from https://chemfp.com/download/ . The default 
license agreement (that is, you can use it without a license key) lets you use 
it for any internal purpose.

If anyone is interested in working on a stand-alone SDF parsing library under a 
free software license, I can provide some pointers and feedback, and will 
contribute chemfp's SDF parser under the MIT license.


                                Andrew
                                da...@dalkescientific.com




_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to