[Tutor] variation of Unique items question
Hello once more. I am stuck on how best to tie the finding Unique Items in Lists ideas to my file I am stuck at level below: What I have here taken from the unique items thread does not work as I need to separate each grouping to the hg chain it is in (see below for examples) import sys WFILE=open(sys.argv[1], 'w') def get_list_dup_dict(fname='Z:/datasets/fooyoo.txt', threshold=2): a_list=open(fname, 'r') #print beginning get_list_dup items_dict, dup_dict = {}, {} for i in a_list: items_dict[i] = items_dict.get(i, 0) + 1 for k, v in items_dict.iteritems(): if v==threshold: dup_dict[k] = v return dup_dict def print_list_dup_report(fname='Z:/datasets/fooyoo.txt', threshold=2): #print Beginning report generation dup_dict = get_list_dup_dict(fname='Z:/datasets/fooyoo.txt', threshold=2) for k, v in sorted(dup_dict.iteritems()): print WFILE,'%s occurred %s times' %(k, v) if __name__ == '__main__': print_list_dup_report() My issue is that my file is as follows: hg17_chainMm5_chr15 range=chr7:148238502-148239073 ENST0339563.1 ENST0342196.1 ENST0339563.1 ENST0344055.1 hg17_chainMm5_chr13 range=chr5:42927967-42928726 ENST0279800.3 ENST0309556.3 hg17_chainMm5_chr6 range=chr1:155548627-155549517 ENST0321157.3 ENST0256324.4 I need a print out that would give the line hg17 and then any instances of the ENST that occur more than once only for that chain section. Even better it only prints the hg17 line if it is followed by an instance of ENST that occurs more than once I am hoping for something that gives me an out file roughly like: hg17_chainMm5_chr15 range=chr7:148238502-148239073 ENST0339563.1 occurs 2 times hg17_chainMm5_chr13 range=chr5:42927967-42928726 ENST0279800.3 occurs 2 times All help and ideas appreciated, I am trying to get this finished as soon as possible, the output file will be used to go back to my 2 gb file and pull out the rest of the data I need. Thanks, Scott ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Re: variation of Unique items question
Hello. Kent once again you have responded incredibly quickly in a most helpful manor. I sometimes wonder if the old reference to a Kent-bot has some truth to it. Thanks again, I will play with it and keep on going. Scott ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] sorting a 2 gb file
Hello! I am wondering about the best way to handle sorting some data from some of my results. I have an file in the form shown at the end (please forgive any wrapparounds due to the width of the screen here- the lines starting with ENS end with the e-12 or what have you on same line.) What I would like is to generate an output file of any other ENSE000...e-4 (or whathaveyou) lines that appear in more than one place and for each of those the queries they appear related to. So if the first line ENSE1098330.2|ENSG0013573.6|ENST0350437.2 assembly=N... etc appears as a result in any other query I would like it and the queries it appears as a result to (including the score if possible). My data set the below is taken from is over 2.4 gb so speed and memory considerations come into play. Are sets more effective than lists for this? To save space in the new file I really only need the name of the result up to the | and the score at the end for each. to simplify things, the score could be dropped, and I could check it out as needed later. As always all feedback is very appreciated. Thanks, Scott FILE: This is the number 1 query tested. Results for scoring against Query= hg17_chainMm5_chr17 range=chr1:2040-3330 5'pad=0 3'pad=0 are: ENSE1098330.2|ENSG0013573.6|ENST0350437.2 assembly=N...72 1e-12 ENSE1160046.1|ENSG0013573.6|ENST0251758.3 assembly=N...72 1e-12 ENSE1404464.1|ENSG0013573.6|ENST0228264.4 assembly=N...72 1e-12 ENSE1160046.1|ENSG0013573.6|ENST0290818.3 assembly=N...72 1e-12 ENSE1343865.2|ENSG0013573.6|ENST0350437.2 assembly=N...46 8e-05 ENSE1160049.1|ENSG0013573.6|ENST0251758.3 assembly=N...46 8e-05 ENSE1343865.2|ENSG0013573.6|ENST0228264.4 assembly=N...46 8e-05 ENSE1160049.1|ENSG0013573.6|ENST0290818.3 assembly=N...46 8e-05 This is the number 2 query tested. Results for scoring against Query= hg17_chainMm5_chr1 range=chr1:82719-95929 5'pad=0 3'pad=0 are: ENSE1373792.1|ENSG0175182.4|ENST0310585.3 assembly=N...80 6e-14 ENSE1134144.2|ENSG0160013.2|ENST0307155.2 assembly=N...78 2e-13 ENSE1433065.1|ENSG0185480.2|ENST0358383.1 assembly=N...78 2e-13 ENSE1422761.1|ENSG0183160.2|ENST0360503.1 assembly=N...74 4e-12 ENSE1431410.1|ENSG0139631.6|ENST0308926.3 assembly=N...74 4e-12 ENSE1433065.1|ENSG0185480.2|ENST0358383.1 assembly=N...72 1e-11 ENSE1411753.1|ENSG0126882.4|ENST0358329.1 assembly=N...72 1e-11 ENSE1428167.1|ENSG0110497.4|ENST0314823.4 assembly=N...72 1e-11 ENSE1401130.1|ENSG0160828.5|ENST0359898.1 assembly=N...72 1e-11 ENSE1414900.1|ENSG0176920.4|ENST0356650.1 assembly=N...72 1e-11 ENSE1428167.1|ENSG0110497.4|ENST0314823.4 assembly=N...72 1e-11 ENSE1400942.1|ENSG0138670.5|ENST0356373.1 assembly=N...72 1e-11 ENSE1400116.1|ENSG0120907.6|ENST0356368.1 assembly=N...70 6e-11 ENSE1413546.1|ENSG0184209.6|ENST0344033.2 assembly=N...70 6e-11 ENSE1433572.1|ENSG0124243.5|ENST0355583.1 assembly=N...70 6e-11 ENSE1423154.1|ENSG0125875.4|ENST0354200.1 assembly=N...70 6e-11 ENSE1400109.1|ENSG0183785.3|ENST0339190.2 assembly=N...70 6e-11 ENSE1268950.4|ENSG0084112.4|ENST0303438.2 assembly=N...68 2e-10 ENSE1057279.1|ENSG0161270.6|ENST0292886.2 assembly=N...68 2e-10 ENSE1434317.1|ENSG0171453.2|ENST0304004.2 assembly=N...68 2e-10 ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
RE: [Tutor] sorting a 2 gb file- i shrunk it and turned it around
Thanks for the thoughts so far. After posting I have been thinking about how to pare down the file (much of the info in the big file was not relevant to this question at hand). After the first couple of responses I was even more motivated to shrink the file so not have to set up a db. This test will be run only now and later to verify with another test set so the db set up seemed liked more work than might be worth it. I was able to reduce my file down about 160 mb in size by paring out every line not directly related to what I want by some simple regular expressions and a couple tests for inclusion. The format and what info is compared against what is different from my original examples as I believe this is more clear. my queries are named by the lines such as: ENSE1387275.1|ENSG0187908.1|ENST0339871.1 ENSE is an exon ENSG is the gene ENST is a transcript They all have the above format, they differ in in numbers above following ENS[E,G orT]. Each query is for a different exon. For background each gene has many exons and there are different versions of which exons are in each gene in this dataset. These different collections are the transcripts ie ENST0339871.1 in short a transcript is a version of a gene here transcript 1 may be formed of exons a,b and c transcript 2 may contain exons a,b,d the other lines (results) are of the format hg17_chainMm5_chr7_random range=chr10:124355404-124355687 5'pad=...44 0.001 hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...44 0.001 hg17_chainMm5_chr7_random range=chr10:124355404-124355687 is the important part here from 5'pad on is not important at this point What I am trying to do is now make a list of any of the results that appear in more than one transcript ## FILE SAMPLE: This is the number 1 query tested. Results for scoring against Query= ENSE1387275.1|ENSG0187908.1|ENST0339871.1 are: hg17_chainMm5_chr7_random range=chr10:124355404-124355687 5'pad=...44 0.001 hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...44 0.001 hg17_chainMm5_chr7 range=chr10:124355391-124355690 5'pad=0 3'pad...44 0.001 hg17_chainMm5_chr6 range=chr10:124355389-124355690 5'pad=0 3'pad...44 0.001 hg17_chainMm5_chr7 range=chr10:124355388-124355687 5'pad=0 3'pad...44 0.001 hg17_chainMm5_chr7_random range=chr10:124355388-124355719 5'pad=...44 0.001 This is the number 3 query tested. Results for scoring against Query= ENSE1365999.1|ENSG0187908.1|ENST0339871.1 are: hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...60 2e-08 hg17_chainMm5_chr7 range=chr10:124355391-124355690 5'pad=0 3'pad...60 2e-08 hg17_chainMm5_chr6 range=chr10:124355389-124355690 5'pad=0 3'pad...60 2e-08 hg17_chainMm5_chr7 range=chr10:124355388-124355687 5'pad=0 3'pad...60 2e-08 ## I would like to generate a file that looks for any results (the hg17_etc line) that occur in more than transcript (from the query line ENSE1365999.1|ENSG0187908.1|ENST0339871.1) so if hg17_chainMm5_chr7_random range=chr10:124355404-124355687 shows up again later in the file I want to know and want to record where it is used more than once, otherwise I will ignore it. I am think another reg expression to capture the transcript id followed by something that captures each of the results, and writes to another file anytime a result appears more than once, and ties the transcript ids to them somehow. Any suggestions? I agree if I had more time and was going to be doing more of this the DB is the way to go. -As an aside I have not looked into sqlite, I am hoping to avoid the db right now, I'd have to get the sys admin to give me permission to install something again etc etc. Where as I am hoping to get this together in a reasonably short script. However I will look at it later (it could be helpful for other things for me. Thanks again to all, Scott ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] sorting and editing large data files
Hello! I recently suffered a loss of programming files (and I had been putting off my backups...) The project I am working on involved the use of programs which I had been assisted with by some of the group. I am rewritten the programs however my results are not what I expected. I would appreciate anyone who could follow through the steps and programs below and tell me if it seems correct, which it does to me. It could be the error is based on a flaw that I had made before. I will give very simplified explanations of the data to help. Starting a file with genetic data about 95 mb A sample of the file format is: ENSE1384652.1|ENSG0166157.5|ENST0359693.1 assembly=NCBI35|chr=21|strand=reverse|bases 10012801 to 10012624|exons plus upstream and downstream regions for exon GCGGCCGTTCAAGGCAGCCGTCTCCGAGCGGCCCAAGGGAGGGCACAACAGCTGCTACCTGAACAGTTTCTGACCCAACAGTTACCCAGCGCCGGACTCGCTGCGGGCGGCTCTAGGGACGGCGCCTACACTTAGCTCCGCGCCCGAGGTGAGCCCAG It is hard to tell due to email formating however from the to the words regions for exon is one line, all the GCATs are on one line. ENSExxx is an exon id tag followed by a ENSGxgene id tag then a ENSTxxx transcript id tag followed by information about the location of exon. the letters GCTA are the actually exon sequence. 1 gene can have different versions, these are called transcripts and each transcript contains a number of exons In order to visually understand the data better I made a script to organize it into sections with Gene ID, then Trancript ID followed by the different Exon IDs like so: NEW GENE ENSG0166157.5 ENST0359693.1 ENSE1384652.1 ENSE1365174.1 ENSE1372282.1 ENSE1369818.1 ENSE1371166.1 ENSE1369770.1 ENSE1122873.3 ENSE1156126.1 ENSE1156118.1 ENSE1378322.1 ENSE1156105.1 ENSE1156099.1 ENSE1156092.1 ENSE1156083.1 ENSE1156075.1 ENSE1156069.1 ENSE1156063.1 ENSE1156057.1 ENSE1100323.1 ENSE1156046.1 ENSE1126180.1 ENSE1100365.1 ENSE1156031.1 ENSE1231719.5 NEW TRANSCRIPT ENSG0166157.5 ENST0298232.4 ENSE1384652.1 ENSE1365174.1 ENSE1372282.1 ENSE1369818.1 ENSE1371166.1 ENSE1369770.1 ENSE1156126.1 ENSE1156118.1 ENSE1378322.1 ENSE1156105.1 ENSE1156099.1 ENSE1156092.1 ENSE1156083.1 ENSE1156075.1 ENSE1156069.1 ENSE1156063.1 ENSE1156057.1 ENSE1100323.1 ENSE1156046.1 ENSE1126180.1 ENSE1100365.1 ENSE1156031.1 ENSE1231719.5 NEW TRANSCRIPT ENSG0166157.5 ENST0342420.2 ENSE1384652.1 ENSE1365174.1 ENSE1372282.1 ENSE1369818.1 ENSE1371166.1 ENSE1369770.1 ENSE1156118.1 ENSE1378322.1 ENSE1156105.1 ENSE1156099.1 ENSE1156092.1 ENSE1156083.1 ENSE1156075.1 ENSE1156069.1 ENSE1156063.1 ENSE1156057.1 ENSE1100323.1 ENSE1156046.1 ENSE1126180.1 ENSE1100365.1 ENSE1156031.1 ENSE1231719.5 NEW TRANSCRIPT ENSG0166157.5 ENST0339704.2 ENSE1384652.1 ENSE1364671.1 ENSE1387876.1 ENSE1384389.1 ENSE1156111.1 ENSE1156105.1 ENSE1156099.1 ENSE1156092.1 ENSE1156083.1 ENSE1156075.1 ENSE1156069.1 ENSE1156063.1 ENSE1156057.1 ENSE1100323.1 ENSE1156046.1 ENSE1126180.1 ENSE1100365.1 ENSE1156031.1 ENSE1231719.5 end of gene group NEW GENE ENSG0187172.4 ENST0335369.2 ENSE1428001.1 ENSE1331994.3 ENSE1398768.1 ENSE1429773.1 etc. This was accomplished with the following program I called OrganizeData.py #3 #execute with python datafile outputfile import re,sys,string #regular expression to pull out gene, transcript and exon ids info=re.compile('^(ENSE\d+\.\d).+(ENSG\d+\.\d).+(ENST\d+\.\d)') # sample from the file #ENSE1373825.1|ENSG0187908.1|ENST0344467.1 file = open( sys.argv[1], 'r' ) #file to read from WriteFile=open(sys.argv[2], 'w')#file to write to #initialize some variables GENEID is the name of the gene, TRANSID is name of the transcript etc sOldGene= sGENEID= sOldTrans= sTRANSID= sEXONID= for line in file: if info.match(line): Matched= info.match(line) sEXONID=Matched.group(1) sGENEID=Matched.group(2) sTRANSID=Matched.group(3) if sOldGene==sGENEID: #if current line is the same gene as last line if sOldTrans == sTRANSID: #if current line is same transcript as last line print WriteFile, sEXONID,#add new exon to the current transcript line else: print WriteFile,\n\n\tNEW TRANSCRIPT\n, sGENEID, sTRANSID, sEXONID, #new transcript of same gene print 2