How about splitting the file on lines consisting of "$$$$", and then
parsing each record? If the parsing fails, you can write out the bad record
for future inspection. (This addresses the basic use case, but not the
"even better" one.)

Here's a proof of concept:

from rdkit import Chem

def read_record(fh):
    lines = []
    for line in fh:
        lines.append(line)
        if line.rstrip() == '$$$$':
            return ''.join(lines)

def read_records(fh):
    while True:
        rec = read_record(fh)
        if rec is None:
            return
        yield rec

sup = Chem.SDMolSupplier()
with open('x.sdf') as fh:
    for rec in read_records(fh):
        sup.SetData(rec)
        mol = next(sup)
        if mol is None:
            print("Bad record:\n", rec)
            continue
        print(mol.GetPropsAsDict())

I worry that this is not strictly correct, because what if the value of a
property happens to be "$$$$"? But apparently RDKit's own SDMolSupplier is
also confused by this (or maybe such values are forbidden by the file
format and/or there's some escape mechanism? I haven't checked), so I don't
feel nearly as bad about that.

Ivan

On Wed, Apr 13, 2022 at 4:29 PM Giovanni Tricarico <
giovanni.tricar...@glpg.com> wrote:

> Hello,
>
> I am using rdkit to read data from SD files.
>
>
>
> My goal is to extract both the molecules and their associated properties
> (which for our purposes are separate entities) from the SDF.
>
> [For 100% clarity: by ‘properties’ I don’t mean calculated properties or
> atom or bond properties, but the text properties that were saved in the SDF
> with each molecule, i.e. those that you get when you do
> mol.GetPropsAsDict() ].
>
>
>
> After several tests I found that Chem.ForwardSDMolSupplier does what I
> need.
>
>
>
> But there is an issue.
>
> When Chem.ForwardSDMolSupplier decides that a molecule is not OK, i.e.
> when it says it is None, the SDF record is lost:
>
> I cannot access its Props; I cannot save the failed SDF record for later
> inspection.
>
> [Or at least, I don’t know how to do it, hence this question].
>
> At most I can collect the indices of the records that fail.
>
>
>
> > Would anyone be able to suggest how to save to a text file (which an SDF
> essentially already is) the SDF records for which
> Chem.ForwardSDMolSupplier returns a None?
>
> > Even better, could the properties associated to the failed molecules be
> read independently? In theory the properties are in a separate part of the
> CTAB, so even when the atoms, bonds, etc. have a problem, the properties
> might still be OK.
>
>
>
> (Note: PandasTools.LoadSDF gives the same issue, it does not even store
> in the DataFrame the records for which the molecule is None, and in any
> case it cannot be used with the kind of SDF’s I am handling, as it uses an
> enormous amount of memory for the molecules – hence the decision to use
> Chem.ForwardSDMolSupplier and pickle the molecules as soon as they are
> read).
>
>
>
> Thanks
> This e-mail and its attachment(s) (if any) may contain confidential and/or
> proprietary information and is intended for its addressee(s) only. Any
> unauthorized use of the information contained herein (including, but not
> limited to, alteration, reproduction, communication, distribution or any
> other form of dissemination) is strictly prohibited. If you are not the
> intended addressee, please notify the originator promptly and delete this
> e-mail and its attachment(s) (if any) subsequently. Neither Galapagos nor
> any of its affiliates shall be liable for direct, special, indirect or
> consequential damages arising from alteration of the contents of this
> message (by a third party) or as a result of a virus being passed on.
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to