Re: [ccp4bb] PDB data mining

Cale Dakwar Wed, 16 Mar 2011 17:12:08 -0700

Thank you to everyone who replied.  I went through all the suggestions and
in the end used Jason's PyMOL script, using Thomas' cpv.distance suggestion
(which did make it much faster for me) and a few more modifications to
eliminate redundant pairing listings.


Bellow is the modified script, saved as dist_set1.py, run in terminal using
/Applications/PyMOLX11Hybrid.app/Contents/MacOS/MacPyMOL -cq dist_set1.py

It works on all pdb files in the directory /Users/cale/pdb/set1
It works on gziped pdb files so that the folder pdb can be compressed to be
smaller
All pdb files must be in the same folder (set1) as it does not traverse
subdirectories
It is limited to at most ~30,000 files in one folder so I had to split the
~72,000 files mirrored off the PDB into three folders (set1, set2 and set3)
and generate an output file for each individually.  the files can then be
merged together into one using cat.


##### begin script #####
import glob, os, pymol, sys
from pymol import cmd
from chempy import cpv

the_pdb="/Users/cale/pdb/set1"
files = glob.glob(the_pdb+os.sep+"*.ent.gz")

if not len(files):
   print "Please set 'the_pdb' variable to a valid path containing PDB
files."
   sys.exit(1)
else:
   print "Processing %d files." % len(files)

s, outFile = "resn HIS and name ND1", "dist_set1.csv"

 f = open(outFile, 'wb')
# write the header
f.write("PDB\tCHAIN\tRESI\tATOM-A\tCHAIN\tRESI\tATOM-B\tDISTANCE\n")
# for each file in the mirror
for x in files:
   cmd.load(x,finish=1)
   n = cmd.get_names()[0]
   m = cmd.get_model(s).atom
   # pairwise for each atom
   for aa in m:
       for bb in m:
           # avoid distances to self
           if aa==bb: continue
           # avoid duplicates
           if aa>bb: continue
           distance = cpv.distance(aa.coord, bb.coord)
           # don't list if distance is above 10 angstroms
           # if distance > 10 : continue
           f.write( "%s\t%s\t%s\t%s\t%s\t%s\t%d\t%f\n" %
                    (n, aa.chain, aa.resi, aa.index,
                        bb.chain, bb.resi, bb.index,
                        distance))
   cmd.delete(n)
f.close()

print "Processed %d files.  Please see %s for results." % (len(files),
outFile)
##### end script #####

Cheers,
Cale

Re: [ccp4bb] PDB data mining

Reply via email to