On 2022-04-14 08:12 AM, Giovanni Tricarico wrote:
Thank you Nils.

In fact I do want the sanitize + parse to happen, and I do some further checks 
on the molecules, too (ChEMBL pipeline etc).
The issue is that whatever does not pass the initial steps just completely 
disappears and cannot be reported or inspected in any way.

Indeed, making a custom SDF parser would be one option, as an SDF is just text, 
and rigidly 'structured' by its very definition; only, I was hoping someone had 
already written such a parser :)

For now I will just output the indices of the failed records; the user will 
then have to read them in another application for inspection.

Thanks
Giovanni

-----Original Message-----
From: Nils Weskamp <nils.wesk...@gmail.com>
Sent: 13 April 2022 22:55
To: Giovanni Tricarico <giovanni.tricar...@glpg.com>; 
rdkit-discuss@lists.sourceforge.net
Subject: Re: [Rdkit-discuss] how to report SDF records for which 
Chem.ForwardSDMolSupplier returns None?

[You don't often get email from nils.wesk...@gmail.com. Learn why this is 
important at http://aka.ms/LearnAboutSenderIdentification.]

Hello Giovanni,

have you tried using the ForwardSDMolSupplier with sanitize = False and / or 
strictParsing = False ?

This should at least reduce the number of cases where molecules are not 
accepted. You would then have to sanitize the structures yourself afterwards 
and handle possible errors explicitly.

If that doesn't solve your problem, I would consider to write my own parser 
that just ignores everything looking like a CTAB.

Hope this helps,
Nils

Am 13.04.2022 um 18:15 schrieb Giovanni Tricarico:
Hello,

I am using rdkit to read data from SD files.

My goal is to extract both the molecules and their associated
properties (which for our purposes are separate entities) from the SDF.

[For 100% clarity: by 'properties' I don't mean calculated properties
or atom or bond properties, but the text properties that were saved in
the SDF with each molecule, i.e. those that you get when you do
mol.GetPropsAsDict() ].

After several tests I found that Chem.ForwardSDMolSupplier does what I need.

But there is an issue.

When Chem.ForwardSDMolSupplier decides that a molecule is not OK, i.e.
when it says it is None, the SDF record is lost:

I cannot access its Props; I cannot save the failed SDF record for
later inspection.

[Or at least, I don't know how to do it, hence this question].

At most I can collect the indices of the records that fail.

  > Would anyone be able to suggest how to save to a text file (which
an SDF essentially already is) the SDF records for which
Chem.ForwardSDMolSupplier returns a None?

  > Even better, could the properties associated to the failed
molecules be read independently? In theory the properties are in a
separate part of the CTAB, so even when the atoms, bonds, etc. have a
problem, the properties might still be OK.

(Note: PandasTools.LoadSDF gives the same issue, it does not even
store in the DataFrame the records for which the molecule is None, and
in any case it cannot be used with the kind of SDF's I am handling, as
it uses an enormous amount of memory for the molecules - hence the
decision to use Chem.ForwardSDMolSupplier and pickle the molecules as
soon as they are read).

Thanks



I don't know the sdf format well, so please excuse my ignorance, but instead of a custom parser, would it be possible to write a preprocessor to eliminate the offending information? Perhaps something using regular expressions in python, perl, sed, or awk?

Kind regards,
gyro



_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to