Hello! I recently suffered a loss of programming files (and I had been putting off my backups...)
The project I am working on involved the use of programs which I had been assisted with by some of the group. I am rewritten the programs however my results are not what I expected. I would appreciate anyone who could follow through the steps and programs below and tell me if it seems correct, which it does to me. It could be the error is based on a flaw that I had made before. I will give very simplified explanations of the data to help. Starting a file with genetic data about 95 mb A sample of the file format is: >ENSE00001384652.1|ENSG00000166157.5|ENST00000359693.1 assembly=NCBI35|chr=21|strand=reverse|bases 10012801 to 10012624|exons plus upstream and downstream regions for exon GCGGCCGTTCAAGGCAGGGGGCGGGGCGTCTCCGAGCGGCGGGGCCAAGGGAGGGCACAACAGCTGCTACCTGAACAGTTTCTGACCCAACAGTTACCCAGCGCCGGACTCGCTGCGCCCCGGCGGCTCTAGGGACCCCCGGCGCCTACACTTAGCTCCGCGCCCGAGGTGAGCCCAG It is hard to tell due to email formating however from the ">" to the words "regions for exon" is one line, all the GCATs are on one line. ENSExxxxxxx is an exon id tag followed by a ENSGxxxxxgene id tag then a ENSTxxxxxxx transcript id tag followed by information about the location of exon. the letters GCTA are the actually exon sequence. 1 gene can have different versions, these are called transcripts and each transcript contains a number of exons In order to visually understand the data better I made a script to organize it into sections with Gene ID, then Trancript ID followed by the different Exon IDs like so: NEW GENE ENSG00000166157.5 ENST00000359693.1 ENSE00001384652.1 ENSE00001365174.1 ENSE00001372282.1 ENSE00001369818.1 ENSE00001371166.1 ENSE00001369770.1 ENSE00001122873.3 ENSE00001156126.1 ENSE00001156118.1 ENSE00001378322.1 ENSE00001156105.1 ENSE00001156099.1 ENSE00001156092.1 ENSE00001156083.1 ENSE00001156075.1 ENSE00001156069.1 ENSE00001156063.1 ENSE00001156057.1 ENSE00001100323.1 ENSE00001156046.1 ENSE00001126180.1 ENSE00001100365.1 ENSE00001156031.1 ENSE00001231719.5 NEW TRANSCRIPT ENSG00000166157.5 ENST00000298232.4 ENSE00001384652.1 ENSE00001365174.1 ENSE00001372282.1 ENSE00001369818.1 ENSE00001371166.1 ENSE00001369770.1 ENSE00001156126.1 ENSE00001156118.1 ENSE00001378322.1 ENSE00001156105.1 ENSE00001156099.1 ENSE00001156092.1 ENSE00001156083.1 ENSE00001156075.1 ENSE00001156069.1 ENSE00001156063.1 ENSE00001156057.1 ENSE00001100323.1 ENSE00001156046.1 ENSE00001126180.1 ENSE00001100365.1 ENSE00001156031.1 ENSE00001231719.5 NEW TRANSCRIPT ENSG00000166157.5 ENST00000342420.2 ENSE00001384652.1 ENSE00001365174.1 ENSE00001372282.1 ENSE00001369818.1 ENSE00001371166.1 ENSE00001369770.1 ENSE00001156118.1 ENSE00001378322.1 ENSE00001156105.1 ENSE00001156099.1 ENSE00001156092.1 ENSE00001156083.1 ENSE00001156075.1 ENSE00001156069.1 ENSE00001156063.1 ENSE00001156057.1 ENSE00001100323.1 ENSE00001156046.1 ENSE00001126180.1 ENSE00001100365.1 ENSE00001156031.1 ENSE00001231719.5 NEW TRANSCRIPT ENSG00000166157.5 ENST00000339704.2 ENSE00001384652.1 ENSE00001364671.1 ENSE00001387876.1 ENSE00001384389.1 ENSE00001156111.1 ENSE00001156105.1 ENSE00001156099.1 ENSE00001156092.1 ENSE00001156083.1 ENSE00001156075.1 ENSE00001156069.1 ENSE00001156063.1 ENSE00001156057.1 ENSE00001100323.1 ENSE00001156046.1 ENSE00001126180.1 ENSE00001100365.1 ENSE00001156031.1 ENSE00001231719.5 end of gene group NEW GENE ENSG00000187172.4 ENST00000335369.2 ENSE00001428001.1 ENSE00001331994.3 ENSE00001398768.1 ENSE00001429773.1 etc. This was accomplished with the following program I called OrganizeData.py #########################################################3 #execute with python datafile outputfile import re,sys,string #regular expression to pull out gene, transcript and exon ids info=re.compile('^>(ENSE\d+\.\d).+(ENSG\d+\.\d).+(ENST\d+\.\d)') # sample from the file #>>ENSE00001373825.1|ENSG00000187908.1|ENST00000344467.1 file = open( sys.argv[1], 'r' ) #file to read from WriteFile=open(sys.argv[2], 'w') #file to write to #initialize some variables GENEID is the name of the gene, TRANSID is name of the transcript etc sOldGene="" sGENEID="" sOldTrans="" sTRANSID="" sEXONID="" for line in file: if info.match(line): Matched= info.match(line) sEXONID=Matched.group(1) sGENEID=Matched.group(2) sTRANSID=Matched.group(3) if sOldGene==sGENEID: #if current line is the same gene as last line if sOldTrans == sTRANSID: #if current line is same transcript as last line print >> WriteFile, sEXONID, #add new exon to the current transcript line else: print >> WriteFile,"\n\n\tNEW TRANSCRIPT\n", sGENEID, sTRANSID, sEXONID, #new transcript of same gene print 2 #lines down else: print >> WriteFile,"\nend of gene group\n\nNEW GENE\n", sGENEID , sTRANSID, sEXONID, #new gene drop down 4 lines for #readability sOldGene=sGENEID sOldTrans=sTRANSID ########################################################## Within each gene if an exon is present in each transcript it is non informative for my purposes and with help I wrote a script to create two files, a list 1 exonID per line of the exons that are present in each transcript and another that let me look at the data in a different way, namely the gene with the set of redundant exons. THe script is as follows: ##########################################################3 # exon editing script #use set to use calculus style editing import sys, string # #split it up so each exon is checked against the other transcripts for the same gene #if the exon occurs in all transcripts write it to file # #use the format on a line in the file of # NEW GENE # ENSG00000184895.2 ENST00000327563.1 ENSE00001299380.1 # NEW TRANSCRIPT # ENSG00000184895.2 ENST00000341596.1 ENSE00001366589.1 ENSE00001364470.1 # end of gene group # NEW GENE # ENSG00000099715.5 ENST00000333703.3 ENSE00001324247.1 ENSE00001300317.1 ENSE00001317536.1 ENSE00000981568.3 ENSE00000652232.1 ENSE00001091626.2 # NEW TRANSCRIPT # ENSG00000099715.5 ENST00000342434.1 ENSE00001365337.1 ENSE00001387281.1 ENSE00001368624.1 ENSE00001389148.1 ENSE00000981568.3 ENSE00000652232.1 ENSE00001377109.1 #regular expression to pull out gene, transcript and exon ids info=re.compile('^(ENSG\d+\.\d).+(ENST\d+\.\d).+(ENSE\d+\.\d)+') #above is match gene, transcript, then one or more exons #TFILE = open(sys.argv[1], 'r' ) #read the various transcripts from WFILE=open(sys.argv[1], 'w') # file to write 2 careful with 'w' will overwrite old info in file W2FILE=open(sys.argv[2], 'w') #this file will have the names of redundant exons import sets def getintersections(fname='Z:\datasets\h35GroupedDec15b.txt'): exonSets = {} f = open(fname) for line in f: if line.startswith('ENS'): parts = line.split() gene = parts[0] transcript = parts[1] exons = parts[2:] exonSets.setdefault(gene, sets.Set(exons)).intersection(sets.Set(exons)) return exonSets if __name__ == '__main__': es = getintersections('Z:\datasets\h35GroupedDec15b.txt') for k,v in es.items(): print >> WFILE, k,v,"\n\n" for Rexon in v: print >> W2FILE, Rexon #################################################################### The ouput files look like so: First is: ENSG00000160818.4 Set(['ENSE00001054700.1', 'ENSE00001054696.4', 'ENSE00001054703.1', 'ENSE00001377097.1', 'ENSE00001376166.1', 'ENSE00001363386.1', 'ENSE00001054698.3', 'ENSE00001054701.1']) ENSG00000102081.4 Set(['ENSE00000677348.1', 'ENSE00000677356.1', 'ENSE00000677362.1', 'ENSE00000677344.1', 'ENSE00000677352.1', 'ENSE00001256572.2', 'ENSE00001378539.2', 'ENSE00001378572.1', 'ENSE00000677358.1', 'ENSE00001385529.1', 'ENSE00000677354.1', 'ENSE00000677346.1', 'ENSE00000677350.1', 'ENSE00000677366.2', 'ENSE00000677364.1', 'ENSE00000677360.1', 'ENSE00001023649.1']) ENSG00000124159.4 Set(['ENSE00001401662.1', 'ENSE00001366493.2', 'ENSE00000844941.1', 'ENSE00001306994.2', 'ENSE00000844932.1', 'ENSE00000844938.1', 'ENSE00000844930.1', 'ENSE00000906910.1', 'ENSE00000906908.1', 'ENSE00001430846.1', 'ENSE00001224295.6']) etc the gene with the set of what should be exons found in each transcript, ie if there are 4 transcripts the exon must be in all four to be included in the set the second output file should be exons only found in each transcript: ENSE00001054700.1 ENSE00001054696.4 ENSE00001054703.1 ENSE00001377097.1 ENSE00001376166.1 ENSE00001363386.1 ENSE00001054698.3 ENSE00001054701.1 etc. First problem. When I look at the dataset from which these two outputs are created the exons are not present in all transcripts. When I look at the first gene listed in the first output file the first exon ENSE00001054700.1 is present in both the transcripts listed in Z:\datasets\h35GroupedDec15b.txt however the second exon in that list (ENSE00001054696.4) is only found in the first of the two transcripts. as below: NEW GENE ENSG00000160818.4 ENST00000292338.3 ENSE00001363386.1 ENSE00001054698.3 ***ENSE00001054700.1*** ENSE00001054701.1 ENSE00001054703.1 ENSE00001377097.1 ENSE00001376166.1 ***ENSE00001054696.4**** NEW TRANSCRIPT ENSG00000160818.4 ENST00000334588.2 ENSE00001434292.1 ENSE00001054698.3 ***ENSE00001054700.1*** ENSE00001054701.1 ENSE00001054703.1 ENSE00001336475.3 end of gene group Any suggestions on where I am going wrong? Scott _______________________________________________ Tutor maillist - [EMAIL PROTECTED] http://mail.python.org/mailman/listinfo/tutor