Hi all, I’m just going to post this in case it’s useful to anyone else - and if not, at least I’ll be able to google it for later reference.
It was more of a python learning exercise for me than anything else - I’m sure there is probably a better and easier way to do this - but perhaps it will be useful to someone. This is a script that is useful in a fairly specific situation - when you have a molecular replacement solution of a large, multi protein complex and you want to mutate each chain in the search model to match the target sequence. It reads a file (seq_name) containing the target sequences in fasta format (with an additional “>” line after the last sequence, order of the sequences is not important), matches them each to the best aligning chain in the molecule (mol_id), and aligns and mutates the chain before finally associating each target sequence with the appropriate chain. At the moment it only looks for one chain matching each target sequence, but it would be easy enough to modify it to take account of multiple identical copies. Unfortunately at the moment it spawns one info-dialog per align_and_mutate job, because I don’t know how to automatically dismiss them from within the script. It uses a couple of modules from Biopython, so Biopython will need to be installed and accessible to coot for this script to work (place Bio and BioSQL directories inside the site-packages subdir of Coot’s python installation and modify the PYTHONPATH variable in /usr/local/bin/coot accordingly). It will also only work with coot nightlies r4872 or later. Incidentally, biopython seems to play pretty happily with coot, no negative effects that I’ve noted so far, and it has a few useful tools, particularly for dealing with sequences (e.g. being able to directly blast the sequence of a PDB and return the sequence of the top hit). Any comments/criticisms much appreciated - am trying to teach myself python but am starting from scratch pretty much, so if there is any obviously stupid code in this script I’d love to know how to make it better! Cheers, Oliver.
#Script to automatically match, align and mutate multiple chains in a PDB file to a series of FASTA-formatted sequences in arbitrary order. #For this script to work, you need Biopython installed and accessible to COOT - this may require altering $PYTHONPATH in the coot startup script. import re from Bio import pairwise2 from Bio.SubsMat import MatrixInfo as matlist #Use BLOSUM62 substitution matrix for test alignments. matrix=matlist.blosum62 #Gap open/extend penalties for sequence alignment (used during the Biopython alignment and the align_and_mutate() steps). gap_open=-11 gap_extend=-1 #Alter the penalties from the COOT defaults. set_alignment_gap_and_space_penalty(gap_open,gap_extend) #Give the name of your sequence file containing multiple FASTA sequences. seq_name="Your_sequences.fasta" #Give the molecule ID to be aligned/mutated. mol_id=0 def get_seqs_from_fasta(fasta_file): #Extract each fasta-formatted sequence from a text file (there must be a line with ">" after the last sequence - example follows ) #>SEQ1 #AAAAAAAAA #>LAST_SEQ #AAAAAAAAA #> with open(fasta_file) as fp: i=0 raw_seqs_from_file={} for result in re.findall('(>(.*?\n))(.*?)(>(.*?\n))', fp.read(), re.S): raw_seqs_from_file[i]=str(result[2]).replace("\n","").upper() i=i+1 return raw_seqs_from_file def get_seqs_from_pdb_chain(mol_id,ch_id): #Make a dictionary to translate 3-letter aa-codes to 1 letter aa-codes amino_acid_dic={'ALA':'A','ARG':'R','ASN':'N','ASP':'D','CYS':'C','GLU':'E','GLN':'Q','GLY':'G','HIS':'H','ILE':'I','LEU':'L','LYS':'K','MET':'M','MSE':'M','PHE':'F','PRO':'P','SER':'S','THR':'T','TRP':'W','TYR':'Y','VAL':'V','UNK':'X'} sn=0 res=resname_from_serial_number(mol_id,ch_id,sn) seq='' last_sn=chain_n_residues(ch_id,mol_id)-1 #Get sequence from PDB file. This ignores gaps. while sn<=last_sn: res=resname_from_serial_number(mol_id,ch_id,sn) aa=str(amino_acid_dic.get(res,'')) seq=seq+aa sn=sn+1 result=seq #Return sequence. return result raw_seqs_from_file=get_seqs_from_fasta(seq_name) ch_id_seq_dic={} #Create a dictionary with chain ids and matched sequences. for ch_id in chain_ids(mol_id): ch_id_seq_dic[ch_id]=get_seqs_from_pdb_chain(mol_id,ch_id) ch_id_match_dic={} #Find the best-aligning chain for each target sequence. for key in sorted(raw_seqs_from_file): target_seq=raw_seqs_from_file[key] score_dic={} for ch_id in sorted(ch_id_seq_dic): model_seq=ch_id_seq_dic[ch_id] score=pairwise2.align.globalds(target_seq, model_seq, matrix, gap_open, gap_extend, score_only=1, one_alignment_only=1) score_dic[ch_id]=score ch_id_best=max(score_dic,key=score_dic.get) print key,ch_id_best,score_dic[ch_id_best] ch_id_match_dic[str(ch_id_best)]=raw_seqs_from_file[key] #Mutate each matched chain to the target sequence. Spawns too many info-dialogs, but I don't know how to dismiss them. for ch_id in sorted(ch_id_match_dic): fasta_seq=fasta_seq=">\n%s"%str(ch_id_match_dic[ch_id]) align_and_mutate(mol_id,ch_id,fasta_seq,0) assign_fasta_sequence(mol_id,ch_id,fasta_seq)