Thank you Ivan, that's great!
It does exactly what I wanted.
OK, I cannot read the properties for wrong records, but one could argue when
the molecule is wrong there is no point reading the properties.
I only adjusted a couple of details (made sure that the file is read as text,
not binary, otherwise no valid record is found; included the handling of
gzipped files; added a writer to store the wrong records as they are read).
Something like this:
import gzip
import rdkit
from rdkit import Chem
import pickle
def read_record(fh):
lines = []
for line in fh:
lines.append(line)
if line.rstrip() == '$$$$':
return ''.join(lines)
def read_records(fh):
while True:
rec = read_record(fh)
if rec is None:
return
yield rec
sup = Chem.SDMolSupplier()
d = dict()
mols_pickled = []
i = 0
with gzip.open('x.sdf.gz', 'rt') as fh, gzip.open('x_wrong.sdf.gz', 'wt') as
fh_wrong:
for rec in read_records(fh):
sup.SetData(rec)
mol = next(sup)
if mol is None:
fh_wrong.write(rec)
else:
d[i] = mol.GetPropsAsDict()
mols_pickled.append(pickle.dumps(mol))
i += 1
After running this, there is a file with the wrong records (if any), the
correct molecules are pickled and stored in a list, and their data are in a
dictionary of dictionaries that can be converted for instance to a DataFrame.
I take your point of caution regarding whether it is safe or not to split
records by detecting a line exactly equal to string '$$$$'.
Not sure how much of a practical problem it really is.
I can report that I've just ran the above script on an SDF with 1.5 M molecules
coming from several sources, and it completed in 5 minutes without stumbling.
But yes, if someone is dead set on making this go wrong, it is possible, by
just defining a property with exact value '$$$$' and writing it out to an SDF.
I've yet to come across such an occurrence, after several years of working with
SDF's. So I'm not overly worried.
Thanks again for your input!
Giovanni
From: Ivan Tubert-Brohman <[email protected]>
Sent: 14 April 2022 12:58
To: Giovanni Tricarico <[email protected]>
Cc: [email protected]
Subject: Re: [Rdkit-discuss] how to report SDF records for which
Chem.ForwardSDMolSupplier returns None?
You don't often get email from
[email protected]<mailto:[email protected]>.
Learn why this is important<http://aka.ms/LearnAboutSenderIdentification>
How about splitting the file on lines consisting of "$$$$", and then parsing
each record? If the parsing fails, you can write out the bad record for future
inspection. (This addresses the basic use case, but not the "even better" one.)
Here's a proof of concept:
from rdkit import Chem
def read_record(fh):
lines = []
for line in fh:
lines.append(line)
if line.rstrip() == '$$$$':
return ''.join(lines)
def read_records(fh):
while True:
rec = read_record(fh)
if rec is None:
return
yield rec
sup = Chem.SDMolSupplier()
with open('x.sdf') as fh:
for rec in read_records(fh):
sup.SetData(rec)
mol = next(sup)
if mol is None:
print("Bad record:\n", rec)
continue
print(mol.GetPropsAsDict())
I worry that this is not strictly correct, because what if the value of a
property happens to be "$$$$"? But apparently RDKit's own SDMolSupplier is also
confused by this (or maybe such values are forbidden by the file format and/or
there's some escape mechanism? I haven't checked), so I don't feel nearly as
bad about that.
Ivan
On Wed, Apr 13, 2022 at 4:29 PM Giovanni Tricarico
<[email protected]<mailto:[email protected]>> wrote:
Hello,
I am using rdkit to read data from SD files.
My goal is to extract both the molecules and their associated properties (which
for our purposes are separate entities) from the SDF.
[For 100% clarity: by 'properties' I don't mean calculated properties or atom
or bond properties, but the text properties that were saved in the SDF with
each molecule, i.e. those that you get when you do mol.GetPropsAsDict() ].
After several tests I found that Chem.ForwardSDMolSupplier does what I need.
But there is an issue.
When Chem.ForwardSDMolSupplier decides that a molecule is not OK, i.e. when it
says it is None, the SDF record is lost:
I cannot access its Props; I cannot save the failed SDF record for later
inspection.
[Or at least, I don't know how to do it, hence this question].
At most I can collect the indices of the records that fail.
> Would anyone be able to suggest how to save to a text file (which an SDF
> essentially already is) the SDF records for which Chem.ForwardSDMolSupplier
> returns a None?
> Even better, could the properties associated to the failed molecules be read
> independently? In theory the properties are in a separate part of the CTAB,
> so even when the atoms, bonds, etc. have a problem, the properties might
> still be OK.
(Note: PandasTools.LoadSDF gives the same issue, it does not even store in the
DataFrame the records for which the molecule is None, and in any case it cannot
be used with the kind of SDF's I am handling, as it uses an enormous amount of
memory for the molecules - hence the decision to use Chem.ForwardSDMolSupplier
and pickle the molecules as soon as they are read).
Thanks
This e-mail and its attachment(s) (if any) may contain confidential and/or
proprietary information and is intended for its addressee(s) only. Any
unauthorized use of the information contained herein (including, but not
limited to, alteration, reproduction, communication, distribution or any other
form of dissemination) is strictly prohibited. If you are not the intended
addressee, please notify the originator promptly and delete this e-mail and its
attachment(s) (if any) subsequently. Neither Galapagos nor any of its
affiliates shall be liable for direct, special, indirect or consequential
damages arising from alteration of the contents of this message (by a third
party) or as a result of a virus being passed on.
_______________________________________________
Rdkit-discuss mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss<https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Frdkit-discuss&data=04%7C01%7C%7C330ceb2f01a44258904108da1e05aa97%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637855306937710558%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=9gDgX7vRv%2F0L%2BMKtf38d%2BonQplsDc15VezxXSvWiMq4%3D&reserved=0>
This e-mail and its attachment(s) (if any) may contain confidential and/or
proprietary information and is intended for its addressee(s) only. Any
unauthorized use of the information contained herein (including, but not
limited to, alteration, reproduction, communication, distribution or any other
form of dissemination) is strictly prohibited. If you are not the intended
addressee, please notify the originator promptly and delete this e-mail and its
attachment(s) (if any) subsequently. Neither Galapagos nor any of its
affiliates shall be liable for direct, special, indirect or consequential
damages arising from alteration of the contents of this message (by a third
party) or as a result of a virus being passed on.
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss