On Apr 14, 2022, at 09:16, Gyro Funch <gyromagne...@gmail.com> wrote: > I don't know the sdf format well, so please excuse my ignorance, but instead > of a custom parser, would it be possible to write a preprocessor to eliminate > the offending information? Perhaps something using regular expressions in > python, perl, sed, or awk?
The SDF format is too complicated to be parsed with a regular expression[1], and the failure modes often cannot be detected at the syntax level[2]. I suggest people may consider using chemfp for this [3]. [1] For example, in a V2000-formatted record, the number of atom records and the number of bond records are given by a repeat count. A traditional/formal regular expression does not support counts where the count from the pattern matching. Most regular expression engines have more powerful capabilities than formal regular expression, such as matches to back-reference captured groups. However, few support using a backreference as a repeat count. I wrote one that did, which would let you specify (?P<atom_count>...)(?P<bond_count>...) and so on) (?P<atom>(?P<atom_x>.{10})(?P<atom_y>.{10}) and so on){atom_count} (?P<bond>(?P<from_atom>.{3})(?P<from_bond>.{3}) and so on){bond_count} but in practice, defining the grammar through a regular expression grammar was decidedly not easy! I've wanted to experiment with using WUFFS to make a low-level SDF parser library, see https://github.com/google/wuffs [2] For example, RDKit by default rejects atoms where the valence is too high. Detecting this in filter code calls for reverse-engineering what RDKit already does. [3] Chemfp is best known as a fingerprint generation and search program. However, there are a few use cases where I wanted to have access to the input record (eg, to detect toolkit failures, or to add fingerprint data to the input record rather than round-tripping the SDF through a toolkit.) I did this by writing my own SDF record reader (in the "text_toolkit"), and writing a wrapper to the RDKit toolkit (in the "rdkit_toolkit"), and using a error handler which can decide how to handle errors (ignore, report, raise an exception, log, etc.). That error handler has access to location information, which includes the record number, the record text, the line number of the start of the record, and more. Here's what it looks like for Giovanni's use case: from chemfp import rdkit_toolkit as T from chemfp import text_toolkit filename = "/Users/dalke/databases/ChEBI_complete_3star.sdf.gz" class ErrorHandler: def __init__(self): self.error_ids = [] def error(self, msg, location, extra=None): record = location.record chebi_id = text_toolkit.get_sdf_tag(location.record, "ChEBI ID") print(f"!!! Error reading record {location.recno} with ID: {chebi_id!r}") print(f" at {location.where()}") self.error_ids.append(chebi_id) errors = ErrorHandler() count = 0 num_atoms = 0 for mol in T.read_molecules(filename, errors=errors): count += 1 num_atoms += mol.GetNumAtoms() # This is a RDMol. print(f"Parsed {count} records ({num_atoms} atoms), skipped {len(errors.error_ids)}.") This functionality is available in the pre-compiled version of chemfp for Linux-base OSes, available from https://chemfp.com/download/ . The default license agreement (that is, you can use it without a license key) lets you use it for any internal purpose. If anyone is interested in working on a stand-alone SDF parsing library under a free software license, I can provide some pointers and feedback, and will contribute chemfp's SDF parser under the MIT license. Andrew da...@dalkescientific.com _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss