Hi all,

I’m just going to post this in case it’s useful to anyone else - and if not, at 
least I’ll be able to google it for later reference.

It was more of a python learning exercise for me than anything else - I’m sure 
there is probably a better and easier way to do this - but perhaps it will be 
useful to someone.

This is a script that is useful in a fairly specific situation - when you have 
a molecular replacement solution of a large, multi protein complex and you want 
to mutate each chain in the search model to match the target sequence.

It reads a file (seq_name) containing the target sequences in fasta format 
(with an additional “>” line after the last sequence, order of the sequences is 
not important), matches them each to the best aligning chain in the molecule 
(mol_id), and aligns and mutates the chain before finally associating each 
target sequence with the appropriate chain. At the moment it only looks for one 
chain matching each target sequence, but it would be easy enough to modify it 
to take account of multiple identical copies. Unfortunately at the moment it 
spawns one info-dialog per align_and_mutate job, because I don’t know how to 
automatically dismiss them from within the script.

It uses a couple of modules from Biopython, so Biopython will need to be 
installed and accessible to coot for this script to work (place Bio and BioSQL 
directories inside the site-packages subdir of Coot’s python installation and 
modify the PYTHONPATH variable in /usr/local/bin/coot accordingly). It will 
also only work with coot nightlies r4872 or later. 

Incidentally, biopython seems to play pretty happily with coot, no negative 
effects that I’ve noted so far, and it has a few useful tools, particularly for 
dealing with sequences (e.g. being able to directly blast the sequence of a PDB 
and return the sequence of the top hit).

Any comments/criticisms much appreciated - am trying to teach myself python but 
am starting from scratch pretty much, so if there is any obviously stupid code 
in this script I’d love to know how to make it better!

Cheers,
Oliver.

#Script to automatically match, align and mutate multiple chains in a PDB file to a series of FASTA-formatted sequences in arbitrary order.
#For this script to work, you need Biopython installed and accessible to COOT - this may require altering $PYTHONPATH in the coot startup script.
import re
from Bio import pairwise2
from Bio.SubsMat import MatrixInfo as matlist
#Use BLOSUM62 substitution matrix for test alignments.
matrix=matlist.blosum62
#Gap open/extend penalties for sequence alignment (used during the Biopython alignment and the align_and_mutate() steps).
gap_open=-11
gap_extend=-1 
#Alter the penalties from the COOT defaults.
set_alignment_gap_and_space_penalty(gap_open,gap_extend)
#Give the name of your sequence file containing multiple FASTA sequences.
seq_name="Your_sequences.fasta"
#Give the molecule ID to be aligned/mutated.
mol_id=0
def get_seqs_from_fasta(fasta_file):
	#Extract each fasta-formatted sequence from a text file (there must be a line with ">" after the last sequence - example follows )
	#>SEQ1
	#AAAAAAAAA
	#>LAST_SEQ
	#AAAAAAAAA
	#>
	with open(fasta_file) as fp:
		i=0
		raw_seqs_from_file={}
		for result in re.findall('(>(.*?\n))(.*?)(>(.*?\n))', fp.read(), re.S):
			raw_seqs_from_file[i]=str(result[2]).replace("\n","").upper()
			i=i+1
		return raw_seqs_from_file
def get_seqs_from_pdb_chain(mol_id,ch_id):
        #Make a dictionary to translate 3-letter aa-codes to 1 letter aa-codes
	amino_acid_dic={'ALA':'A','ARG':'R','ASN':'N','ASP':'D','CYS':'C','GLU':'E','GLN':'Q','GLY':'G','HIS':'H','ILE':'I','LEU':'L','LYS':'K','MET':'M','MSE':'M','PHE':'F','PRO':'P','SER':'S','THR':'T','TRP':'W','TYR':'Y','VAL':'V','UNK':'X'}
        sn=0
        res=resname_from_serial_number(mol_id,ch_id,sn)
        seq=''
        last_sn=chain_n_residues(ch_id,mol_id)-1
	#Get sequence from PDB file. This ignores gaps.
        while sn<=last_sn:
                res=resname_from_serial_number(mol_id,ch_id,sn)
                aa=str(amino_acid_dic.get(res,''))
		seq=seq+aa
                sn=sn+1
        result=seq
	#Return sequence.
        return result
raw_seqs_from_file=get_seqs_from_fasta(seq_name)
ch_id_seq_dic={}
#Create a dictionary with chain ids and matched sequences.
for ch_id in chain_ids(mol_id):
	ch_id_seq_dic[ch_id]=get_seqs_from_pdb_chain(mol_id,ch_id)
ch_id_match_dic={}
#Find the best-aligning chain for each target sequence. 
for key in sorted(raw_seqs_from_file):
	target_seq=raw_seqs_from_file[key]
	score_dic={}
	for ch_id in sorted(ch_id_seq_dic):
		model_seq=ch_id_seq_dic[ch_id]
		score=pairwise2.align.globalds(target_seq, model_seq, matrix, gap_open, gap_extend, score_only=1, one_alignment_only=1)
		score_dic[ch_id]=score
	ch_id_best=max(score_dic,key=score_dic.get)
	print key,ch_id_best,score_dic[ch_id_best]
	ch_id_match_dic[str(ch_id_best)]=raw_seqs_from_file[key]
#Mutate each matched chain to the target sequence. Spawns too many info-dialogs, but I don't know how to dismiss them.
for ch_id in sorted(ch_id_match_dic):
	fasta_seq=fasta_seq=">\n%s"%str(ch_id_match_dic[ch_id])
	align_and_mutate(mol_id,ch_id,fasta_seq,0)
	assign_fasta_sequence(mol_id,ch_id,fasta_seq)


Reply via email to