Hi Tim,
You might also consider using chemfp, which has this sort of functionality
available through its toolkit wrapper API:
from chemfp import rdkit_toolkit as T
import itertools
with T.read_ids_and_molecules("chembl_28.sdf.gz") as reader:
loc = reader.location
for id, mol in itertools.islice(reader, 5):
print(f"Record: {loc.recno} ({id}) line: {loc.lineno} offsets:
{loc.offsets}")
counts_line = loc.record.splitlines()[3]
num_atoms, num_bonds = int(counts_line[:3]), int(counts_line[3:6])
print(f" counts line #atoms: {num_atoms} #bonds: {num_bonds}")
print(f" RDKit #atoms: {mol.GetNumAtoms()} #bonds:
{mol.GetNumBonds()}")
The output in this case is:
Record: 1 (CHEMBL153534) line: 1 offsets: (0, 1458)
counts line #atoms: 16 #bonds: 17
RDKit #atoms: 16 #bonds: 17
Record: 2 (CHEMBL440060) line: 43 offsets: (1458, 18699)
counts line #atoms: 206 #bonds: 208
RDKit #atoms: 202 #bonds: 204
Record: 3 (CHEMBL440245) line: 466 offsets: (18699, 39688)
counts line #atoms: 251 #bonds: 254
RDKit #atoms: 251 #bonds: 254
Record: 4 (CHEMBL440249) line: 980 offsets: (39688, 56050)
counts line #atoms: 194 #bonds: 205
RDKit #atoms: 185 #bonds: 196
Record: 5 (CHEMBL405398) line: 1388 offsets: (56050, 58447)
counts line #atoms: 27 #bonds: 30
RDKit #atoms: 27 #bonds: 30
You can also work more directly to the record tokenization level, and pass each
record to the rdkit_toolkit wrapper:
from chemfp import text_toolkit
with text_toolkit.read_sdf_records("chembl_28.sdf.gz") as reader:
for rec in itertools.islice(reader, 5):
mol = T.parse_molecule(rec, "sdf")
print(mol.GetProp("chembl_id"), "has", len(rec), "bytes")
which prints
CHEMBL153534 has 1458 bytes
CHEMBL440060 has 17241 bytes
CHEMBL440245 has 20989 bytes
CHEMBL440249 has 16362 bytes
CHEMBL405398 has 2397 bytes
Andrew
[email protected]
> On Nov 4, 2021, at 17:55, Tim Dudgeon <[email protected]> wrote:
>
> Thanks Paolo, that's fantastic.
> The first option was what I needed.
> Tim
>
> On Thu, Nov 4, 2021 at 4:36 PM Paolo Tosco <[email protected]> wrote:
> Hi Tim,
>
> if you need access to the original text, you'll have to do the chunking
> yourself, e.g.:
>
> import gzip
>
> def molgen(hnd):
> mol_text_tmp = ""
> while 1:
> line = hnd.readline()
> if not line:
> return
> line = line.decode("utf-8")
> mol_text_tmp += line
> if line.startswith("$$$$"):
> mol_text = mol_text_tmp
> mol_text_tmp = ""
> yield mol_text
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss