Re: [Tutor] sorting a 2 gb file- i shrunk it and turned it around
My guess is that your file is small enough that Danny's two-pass approach will work. You might even be able to do it in one pass. If you have enough RAM, here is a sketch of a one-pass solution: # This will map each result to a list of queries that contain that result results= {} # Iterate the file building results for line in file: if isQueryLine(line): current_query = line elif isResultLine(line): key = getResultKey(line) results.setdefault(key, []).append(current_query) # see explanation below # Now go through results looking for entries with more than one query for key, queries in results.iteritems(): if len(queries) 1: print print key for query in queries: print query You have to fill in the details of isQueryLine(), isResultLine() and getResultKey(); from your earlier posts I'm guessing you can figure them out. What is this: results.setdefault(key, []).append(current_query) setdefault() is a handy method of a dict. It looks up the value corresponding to the given key. If there is no value, it *sets* the value to the given default, and returns that. So after results.setdefault(key, []) results[key] will have a list in it, and we will have a reference to the list. Then appending the list adds the query to the list in the dict. Please let us know what solution you end up using, and how much memory it needs. I'm interested... Kent Scott Melnyk wrote: Thanks for the thoughts so far. After posting I have been thinking about how to pare down the file (much of the info in the big file was not relevant to this question at hand). After the first couple of responses I was even more motivated to shrink the file so not have to set up a db. This test will be run only now and later to verify with another test set so the db set up seemed liked more work than might be worth it. I was able to reduce my file down about 160 mb in size by paring out every line not directly related to what I want by some simple regular expressions and a couple tests for inclusion. The format and what info is compared against what is different from my original examples as I believe this is more clear. my queries are named by the lines such as: ENSE1387275.1|ENSG0187908.1|ENST0339871.1 ENSE is an exon ENSG is the gene ENST is a transcript They all have the above format, they differ in in numbers above following ENS[E,G orT]. Each query is for a different exon. For background each gene has many exons and there are different versions of which exons are in each gene in this dataset. These different collections are the transcripts ie ENST0339871.1 in short a transcript is a version of a gene here transcript 1 may be formed of exons a,b and c transcript 2 may contain exons a,b,d the other lines (results) are of the format hg17_chainMm5_chr7_random range=chr10:124355404-124355687 5'pad=...44 0.001 hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...44 0.001 hg17_chainMm5_chr7_random range=chr10:124355404-124355687 is the important part here from 5'pad on is not important at this point What I am trying to do is now make a list of any of the results that appear in more than one transcript ## FILE SAMPLE: This is the number 1 query tested. Results for scoring against Query= ENSE1387275.1|ENSG0187908.1|ENST0339871.1 are: hg17_chainMm5_chr7_random range=chr10:124355404-124355687 5'pad=...44 0.001 hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...44 0.001 hg17_chainMm5_chr7 range=chr10:124355391-124355690 5'pad=0 3'pad...44 0.001 hg17_chainMm5_chr6 range=chr10:124355389-124355690 5'pad=0 3'pad...44 0.001 hg17_chainMm5_chr7 range=chr10:124355388-124355687 5'pad=0 3'pad...44 0.001 hg17_chainMm5_chr7_random range=chr10:124355388-124355719 5'pad=...44 0.001 This is the number 3 query tested. Results for scoring against Query= ENSE1365999.1|ENSG0187908.1|ENST0339871.1 are: hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...60 2e-08 hg17_chainMm5_chr7 range=chr10:124355391-124355690 5'pad=0 3'pad...60 2e-08 hg17_chainMm5_chr6 range=chr10:124355389-124355690 5'pad=0 3'pad...60 2e-08 hg17_chainMm5_chr7 range=chr10:124355388-124355687 5'pad=0 3'pad...60 2e-08 ## I would like to generate a file that looks for any results (the hg17_etc line) that occur in more than transcript (from the query line ENSE1365999.1|ENSG0187908.1|ENST0339871.1) so if hg17_chainMm5_chr7_random range=chr10:124355404-124355687 shows up again later in the file I want to know and want to record where it is used more than once, otherwise I will ignore it. I am think another reg expression to capture the transcript id followed by something that captures each of the results, and writes to another file anytime a result appears more than once, and ties the transcript ids to them somehow. Any suggestions? I agree if I had more time
[Tutor] sorting a 2 gb file
Hello! I am wondering about the best way to handle sorting some data from some of my results. I have an file in the form shown at the end (please forgive any wrapparounds due to the width of the screen here- the lines starting with ENS end with the e-12 or what have you on same line.) What I would like is to generate an output file of any other ENSE000...e-4 (or whathaveyou) lines that appear in more than one place and for each of those the queries they appear related to. So if the first line ENSE1098330.2|ENSG0013573.6|ENST0350437.2 assembly=N... etc appears as a result in any other query I would like it and the queries it appears as a result to (including the score if possible). My data set the below is taken from is over 2.4 gb so speed and memory considerations come into play. Are sets more effective than lists for this? To save space in the new file I really only need the name of the result up to the | and the score at the end for each. to simplify things, the score could be dropped, and I could check it out as needed later. As always all feedback is very appreciated. Thanks, Scott FILE: This is the number 1 query tested. Results for scoring against Query= hg17_chainMm5_chr17 range=chr1:2040-3330 5'pad=0 3'pad=0 are: ENSE1098330.2|ENSG0013573.6|ENST0350437.2 assembly=N...72 1e-12 ENSE1160046.1|ENSG0013573.6|ENST0251758.3 assembly=N...72 1e-12 ENSE1404464.1|ENSG0013573.6|ENST0228264.4 assembly=N...72 1e-12 ENSE1160046.1|ENSG0013573.6|ENST0290818.3 assembly=N...72 1e-12 ENSE1343865.2|ENSG0013573.6|ENST0350437.2 assembly=N...46 8e-05 ENSE1160049.1|ENSG0013573.6|ENST0251758.3 assembly=N...46 8e-05 ENSE1343865.2|ENSG0013573.6|ENST0228264.4 assembly=N...46 8e-05 ENSE1160049.1|ENSG0013573.6|ENST0290818.3 assembly=N...46 8e-05 This is the number 2 query tested. Results for scoring against Query= hg17_chainMm5_chr1 range=chr1:82719-95929 5'pad=0 3'pad=0 are: ENSE1373792.1|ENSG0175182.4|ENST0310585.3 assembly=N...80 6e-14 ENSE1134144.2|ENSG0160013.2|ENST0307155.2 assembly=N...78 2e-13 ENSE1433065.1|ENSG0185480.2|ENST0358383.1 assembly=N...78 2e-13 ENSE1422761.1|ENSG0183160.2|ENST0360503.1 assembly=N...74 4e-12 ENSE1431410.1|ENSG0139631.6|ENST0308926.3 assembly=N...74 4e-12 ENSE1433065.1|ENSG0185480.2|ENST0358383.1 assembly=N...72 1e-11 ENSE1411753.1|ENSG0126882.4|ENST0358329.1 assembly=N...72 1e-11 ENSE1428167.1|ENSG0110497.4|ENST0314823.4 assembly=N...72 1e-11 ENSE1401130.1|ENSG0160828.5|ENST0359898.1 assembly=N...72 1e-11 ENSE1414900.1|ENSG0176920.4|ENST0356650.1 assembly=N...72 1e-11 ENSE1428167.1|ENSG0110497.4|ENST0314823.4 assembly=N...72 1e-11 ENSE1400942.1|ENSG0138670.5|ENST0356373.1 assembly=N...72 1e-11 ENSE1400116.1|ENSG0120907.6|ENST0356368.1 assembly=N...70 6e-11 ENSE1413546.1|ENSG0184209.6|ENST0344033.2 assembly=N...70 6e-11 ENSE1433572.1|ENSG0124243.5|ENST0355583.1 assembly=N...70 6e-11 ENSE1423154.1|ENSG0125875.4|ENST0354200.1 assembly=N...70 6e-11 ENSE1400109.1|ENSG0183785.3|ENST0339190.2 assembly=N...70 6e-11 ENSE1268950.4|ENSG0084112.4|ENST0303438.2 assembly=N...68 2e-10 ENSE1057279.1|ENSG0161270.6|ENST0292886.2 assembly=N...68 2e-10 ENSE1434317.1|ENSG0171453.2|ENST0304004.2 assembly=N...68 2e-10 ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
RE: [Tutor] sorting a 2 gb file
I'll just Me Too on Alan's Advice. I had a similar sized project only it was binary data in an ISAM file instead of flat ASCII. I tried several pure python methods and all took forever. Finally I used Python to read-modify-input source data into a mysql database. Then I pulled the data out via python and wrote it to a new ISAM file. The whole thing took longer to code that way but boy it sure scaled MUCH better and was much quicker in the end. John Purser -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Alan Gauld Sent: Tuesday, January 25, 2005 05:09 To: Scott Melnyk; tutor@python.org Subject: Re: [Tutor] sorting a 2 gb file My data set the below is taken from is over 2.4 gb so speed and memory considerations come into play. To be honest, if this were my problem, I'd proably dump all the data into a database and use SQL to extract what I needed. Thats a much more effective tool for this kind of thing. You can do it with Python, but I think we need more understanding of the problem. For example what the various fields represent, how much of a comparison (ie which fields, case sensitivity etc) leads to equality etc. Alan G. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sorting a 2 gb file
Alan Gauld wrote: My data set the below is taken from is over 2.4 gb so speed and memory considerations come into play. To be honest, if this were my problem, I'd proably dump all the data into a database and use SQL to extract what I needed. Thats a much more effective tool for this kind of thing. You can do it with Python, but I think we need more understanding of the problem. For example what the various fields represent, how much of a comparison (ie which fields, case sensitivity etc) leads to equality etc. And if the idea of setting up a full-blown SQL server for the problem seems like a lot of work, you might try prototyping the sort and solutions with sqlite, and only migrate to (full-fledged RDBMS of your choice) if the prototype works as you want it too and sqlite seems too slow for your needs. Andy ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
RE: [Tutor] sorting a 2 gb file- i shrunk it and turned it around
Thanks for the thoughts so far. After posting I have been thinking about how to pare down the file (much of the info in the big file was not relevant to this question at hand). After the first couple of responses I was even more motivated to shrink the file so not have to set up a db. This test will be run only now and later to verify with another test set so the db set up seemed liked more work than might be worth it. I was able to reduce my file down about 160 mb in size by paring out every line not directly related to what I want by some simple regular expressions and a couple tests for inclusion. The format and what info is compared against what is different from my original examples as I believe this is more clear. my queries are named by the lines such as: ENSE1387275.1|ENSG0187908.1|ENST0339871.1 ENSE is an exon ENSG is the gene ENST is a transcript They all have the above format, they differ in in numbers above following ENS[E,G orT]. Each query is for a different exon. For background each gene has many exons and there are different versions of which exons are in each gene in this dataset. These different collections are the transcripts ie ENST0339871.1 in short a transcript is a version of a gene here transcript 1 may be formed of exons a,b and c transcript 2 may contain exons a,b,d the other lines (results) are of the format hg17_chainMm5_chr7_random range=chr10:124355404-124355687 5'pad=...44 0.001 hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...44 0.001 hg17_chainMm5_chr7_random range=chr10:124355404-124355687 is the important part here from 5'pad on is not important at this point What I am trying to do is now make a list of any of the results that appear in more than one transcript ## FILE SAMPLE: This is the number 1 query tested. Results for scoring against Query= ENSE1387275.1|ENSG0187908.1|ENST0339871.1 are: hg17_chainMm5_chr7_random range=chr10:124355404-124355687 5'pad=...44 0.001 hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...44 0.001 hg17_chainMm5_chr7 range=chr10:124355391-124355690 5'pad=0 3'pad...44 0.001 hg17_chainMm5_chr6 range=chr10:124355389-124355690 5'pad=0 3'pad...44 0.001 hg17_chainMm5_chr7 range=chr10:124355388-124355687 5'pad=0 3'pad...44 0.001 hg17_chainMm5_chr7_random range=chr10:124355388-124355719 5'pad=...44 0.001 This is the number 3 query tested. Results for scoring against Query= ENSE1365999.1|ENSG0187908.1|ENST0339871.1 are: hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...60 2e-08 hg17_chainMm5_chr7 range=chr10:124355391-124355690 5'pad=0 3'pad...60 2e-08 hg17_chainMm5_chr6 range=chr10:124355389-124355690 5'pad=0 3'pad...60 2e-08 hg17_chainMm5_chr7 range=chr10:124355388-124355687 5'pad=0 3'pad...60 2e-08 ## I would like to generate a file that looks for any results (the hg17_etc line) that occur in more than transcript (from the query line ENSE1365999.1|ENSG0187908.1|ENST0339871.1) so if hg17_chainMm5_chr7_random range=chr10:124355404-124355687 shows up again later in the file I want to know and want to record where it is used more than once, otherwise I will ignore it. I am think another reg expression to capture the transcript id followed by something that captures each of the results, and writes to another file anytime a result appears more than once, and ties the transcript ids to them somehow. Any suggestions? I agree if I had more time and was going to be doing more of this the DB is the way to go. -As an aside I have not looked into sqlite, I am hoping to avoid the db right now, I'd have to get the sys admin to give me permission to install something again etc etc. Where as I am hoping to get this together in a reasonably short script. However I will look at it later (it could be helpful for other things for me. Thanks again to all, Scott ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sorting a 2 gb file
On Tue, 25 Jan 2005, Scott Melnyk wrote: I have an file in the form shown at the end (please forgive any wrapparounds due to the width of the screen here- the lines starting with ENS end with the e-12 or what have you on same line.) What I would like is to generate an output file of any other ENSE000...e-4 (or whathaveyou) lines that appear in more than one place and for each of those the queries they appear related to. Hi Scott, One way to do this might be to do it in two passes across the file. The first pass through the file can identify records that appear more than once. The second pass can take that knowledge, and then display those records. In pseudocode, this will look something like: ### hints = identifyDuplicateRecords(filename) displayDuplicateRecords(filename, hints) ### My data set the below is taken from is over 2.4 gb so speed and memory considerations come into play. Are sets more effective than lists for this? Sets or dictionaries make the act of lookup of a key fairly cheap. In the two-pass approach, the first pass can use a dictionary to accumulate the number of times a certain record's key has occurred. Note that, because your file is so large, the dictionary probably shouldn't accumulation the whole mass of information that we've seen so far: instead, it's sufficient to record the information we need to recognize a duplicate. If you have more questions, please feel free to ask! ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sorting a 2 gb file
On Jan 25, 2005, at 23:40, Danny Yoo wrote: In pseudocode, this will look something like: ### hints = identifyDuplicateRecords(filename) displayDuplicateRecords(filename, hints) ### My data set the below is taken from is over 2.4 gb so speed and memory considerations come into play. Are sets more effective than lists for this? Sets or dictionaries make the act of lookup of a key fairly cheap. In the two-pass approach, the first pass can use a dictionary to accumulate the number of times a certain record's key has occurred. Note that, because your file is so large, the dictionary probably shouldn't accumulation the whole mass of information that we've seen so far: instead, it's sufficient to record the information we need to recognize a duplicate. However, the first pass will consume a lot of memory. Considering the worst-case scenario where each record only appears once, you'll find yourself with the whole 2GB file loaded into memory. (or do you have a smarter way to do this?) -- Max maxnoel_fr at yahoo dot fr -- ICQ #85274019 Look at you hacker... A pathetic creature of meat and bone, panting and sweating as you run through my corridors... How can you challenge a perfect, immortal machine? ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sorting a 2 gb file
On Tue, 25 Jan 2005, Max Noel wrote: My data set the below is taken from is over 2.4 gb so speed and memory considerations come into play. Are sets more effective than lists for this? Sets or dictionaries make the act of lookup of a key fairly cheap. In the two-pass approach, the first pass can use a dictionary to accumulate the number of times a certain record's key has occurred. Note that, because your file is so large, the dictionary probably shouldn't accumulation the whole mass of information that we've seen so far: instead, it's sufficient to record the information we need to recognize a duplicate. However, the first pass will consume a lot of memory. Considering the worst-case scenario where each record only appears once, you'll find yourself with the whole 2GB file loaded into memory. (or do you have a smarter way to do this?) Hi Max, My assumptions are that each record consists of some identifying string key that's associated to some value. How are we deciding that two records are talking about the same thing? I'm hoping that the set of unique keys isn't itself very large. Under this assumption, we can do something like this: ### from sets import Set def firstPass(f): Returns a set of the duplicate keys in f. seenKeys = Set() duplicateKeys = Set() for record in f: key = extractKey(record) if key in seenKeys: duplicateKeys.add(key) else: seenKeys.add(key) return duplicateKeys ### where we don't store the whole record into memory, but only the 'key' portion of the record. And if the number of unique keys is small enough, this should be fine enough to recognize duplicate records. So on the second passthrough, we can display the duplicate records on-the-fly. If this assumption is not true, then we need to do something else. *grin* One possibility might be to implement an external sorting mechanism: http://www.nist.gov/dads/HTML/externalsort.html But if we're willing to do an external sort, then we're already doing enough work that we should really consider using a DBMS. The more complicated the data management becomes, the more attractive it becomes to use a real database to handle these data management issues. We're trying to solve a problem that is already solved by a real database management system. Talk to you later! ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor