Re: [Tutor] sorting a 2 gb file- i shrunk it and turned it around

2005-01-26 Thread Kent Johnson
My guess is that your file is small enough that Danny's two-pass approach will work. You might even 
be able to do it in one pass.

If you have enough RAM, here is a sketch of a one-pass solution:
# This will map each result to a list of queries that contain that result
results= {}
# Iterate the file building results
for line in file:
  if isQueryLine(line):
current_query = line
  elif isResultLine(line):
key = getResultKey(line)
results.setdefault(key, []).append(current_query) # see explanation below
# Now go through results looking for entries with more than one query
for key, queries in results.iteritems():
  if len(queries)  1:
print
print key
for query in queries:
  print query
You have to fill in the details of isQueryLine(), isResultLine() and getResultKey(); from your 
earlier posts I'm guessing you can figure them out.

What is this:
  results.setdefault(key, []).append(current_query)
setdefault() is a handy method of a dict. It looks up the value corresponding to the given key. If 
there is no value, it *sets* the value to the given default, and returns that. So after
  results.setdefault(key, [])
results[key] will have a list in it, and we will have a reference to the list.

Then appending the list adds the query to the list in the dict.
Please let us know what solution you end up using, and how much memory it 
needs. I'm interested...
Kent
Scott Melnyk wrote:
Thanks for the thoughts so far.  After posting I have been thinking
about how to pare down the file (much of the info in the big file was
not relevant to this question at hand).
After the first couple of responses I was even more motivated to
shrink the file so not have to set up a db. This test will be run only
now and later to verify with another test set so the db set up seemed
liked more work than might be worth it.
I was able to reduce my file down about 160 mb in size by paring out
every line not directly related to what I want by some simple regular
expressions and a couple tests for inclusion.
The format and what info is compared against what is different from my
original examples as I believe this is more clear.
my queries are named by the lines such as:
ENSE1387275.1|ENSG0187908.1|ENST0339871.1
ENSE is an exon   ENSG is the gene ENST is a transcript
They all have the above format, they differ in in numbers above
following ENS[E,G orT].
Each query is for a different exon.  For background each gene has many
exons and there are different versions of which exons are in each gene
in this dataset.  These different collections are the transcripts ie
ENST0339871.1
in short a transcript is a version of a gene here
transcript 1 may be formed of  exons a,b and c 
transcript 2 may contain exons a,b,d 


the other lines (results) are of the format
hg17_chainMm5_chr7_random range=chr10:124355404-124355687 5'pad=...44  0.001
hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...44  0.001
hg17_chainMm5_chr7_random range=chr10:124355404-124355687 is the
important part here from 5'pad on is not important at this point
What I am trying to do is now make a list of any of the results that
appear in more than one transcript
##
FILE SAMPLE:
This is the number 1  query tested.
Results for scoring against Query=
ENSE1387275.1|ENSG0187908.1|ENST0339871.1
 are: 

hg17_chainMm5_chr7_random range=chr10:124355404-124355687 5'pad=...44  0.001
hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...44  0.001
hg17_chainMm5_chr7 range=chr10:124355391-124355690 5'pad=0 3'pad...44  0.001
hg17_chainMm5_chr6 range=chr10:124355389-124355690 5'pad=0 3'pad...44  0.001
hg17_chainMm5_chr7 range=chr10:124355388-124355687 5'pad=0 3'pad...44  0.001
hg17_chainMm5_chr7_random range=chr10:124355388-124355719 5'pad=...44  0.001

This is the number 3  query tested.
Results for scoring against Query=
ENSE1365999.1|ENSG0187908.1|ENST0339871.1
 are: 

hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...60  2e-08
hg17_chainMm5_chr7 range=chr10:124355391-124355690 5'pad=0 3'pad...60  2e-08
hg17_chainMm5_chr6 range=chr10:124355389-124355690 5'pad=0 3'pad...60  2e-08
hg17_chainMm5_chr7 range=chr10:124355388-124355687 5'pad=0 3'pad...60  2e-08
##
I would like to generate a file that looks for any results (the
hg17_etc  line) that occur in more than transcript (from the query
line ENSE1365999.1|ENSG0187908.1|ENST0339871.1)
so if  
hg17_chainMm5_chr7_random range=chr10:124355404-124355687 
 shows up again later in the file I want to know and want to record
where it is used more than once, otherwise I will ignore it.

I am think another reg expression to capture the transcript id
followed by  something that captures each of the results, and writes
to another file anytime a result appears more than once, and ties the
transcript ids to them somehow.
Any suggestions?
I agree if I had more time 

RE: [Tutor] sorting a 2 gb file

2005-01-25 Thread John Purser
I'll just Me Too on Alan's Advice.  I had a similar sized project only it
was binary data in an ISAM file instead of flat ASCII.  I tried several
pure python methods and all took forever.  Finally I used Python to
read-modify-input source data into a mysql database.  Then I pulled the data
out via python and wrote it to a new ISAM file.  The whole thing took longer
to code that way but boy it sure scaled MUCH better and was much quicker in
the end.

John Purser

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf
Of Alan Gauld
Sent: Tuesday, January 25, 2005 05:09
To: Scott Melnyk; tutor@python.org
Subject: Re: [Tutor] sorting a 2 gb file

 My data set the below is taken from is over 2.4 gb so speed and
memory
 considerations come into play.

To be honest, if this were my problem, I'd proably dump all the data
into a database and use SQL to extract what I needed. Thats a much
more effective tool for this kind of thing.

You can do it with Python, but I think we need more understanding
of the problem. For example what the various fields represent, how
much of a comparison (ie which fields, case sensitivity etc) leads
to equality etc.

Alan G.

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] sorting a 2 gb file

2005-01-25 Thread Andrew D. Fant
Alan Gauld wrote:
My data set the below is taken from is over 2.4 gb so speed and
memory
considerations come into play.

To be honest, if this were my problem, I'd proably dump all the data
into a database and use SQL to extract what I needed. Thats a much
more effective tool for this kind of thing.
You can do it with Python, but I think we need more understanding
of the problem. For example what the various fields represent, how
much of a comparison (ie which fields, case sensitivity etc) leads
to equality etc.

And if the idea of setting up a full-blown SQL server for the problem 
seems like a lot of work, you might try prototyping the sort and 
solutions with sqlite, and only migrate to (full-fledged RDBMS of your 
choice) if the prototype works as you want it too and sqlite seems too 
slow for your needs.

Andy
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


RE: [Tutor] sorting a 2 gb file- i shrunk it and turned it around

2005-01-25 Thread Scott Melnyk
Thanks for the thoughts so far.  After posting I have been thinking
about how to pare down the file (much of the info in the big file was
not relevant to this question at hand).

After the first couple of responses I was even more motivated to
shrink the file so not have to set up a db. This test will be run only
now and later to verify with another test set so the db set up seemed
liked more work than might be worth it.

I was able to reduce my file down about 160 mb in size by paring out
every line not directly related to what I want by some simple regular
expressions and a couple tests for inclusion.

The format and what info is compared against what is different from my
original examples as I believe this is more clear.


my queries are named by the lines such as:
ENSE1387275.1|ENSG0187908.1|ENST0339871.1
ENSE is an exon   ENSG is the gene ENST is a transcript

They all have the above format, they differ in in numbers above
following ENS[E,G orT].

Each query is for a different exon.  For background each gene has many
exons and there are different versions of which exons are in each gene
in this dataset.  These different collections are the transcripts ie
ENST0339871.1

in short a transcript is a version of a gene here
transcript 1 may be formed of  exons a,b and c 
transcript 2 may contain exons a,b,d 



the other lines (results) are of the format
hg17_chainMm5_chr7_random range=chr10:124355404-124355687 5'pad=...44  0.001
hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...44  0.001

hg17_chainMm5_chr7_random range=chr10:124355404-124355687 is the
important part here from 5'pad on is not important at this point


What I am trying to do is now make a list of any of the results that
appear in more than one transcript

##
FILE SAMPLE:

This is the number 1  query tested.
Results for scoring against Query=
ENSE1387275.1|ENSG0187908.1|ENST0339871.1
 are: 

hg17_chainMm5_chr7_random range=chr10:124355404-124355687 5'pad=...44  0.001
hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...44  0.001
hg17_chainMm5_chr7 range=chr10:124355391-124355690 5'pad=0 3'pad...44  0.001
hg17_chainMm5_chr6 range=chr10:124355389-124355690 5'pad=0 3'pad...44  0.001
hg17_chainMm5_chr7 range=chr10:124355388-124355687 5'pad=0 3'pad...44  0.001
hg17_chainMm5_chr7_random range=chr10:124355388-124355719 5'pad=...44  0.001



This is the number 3  query tested.
Results for scoring against Query=
ENSE1365999.1|ENSG0187908.1|ENST0339871.1
 are: 

hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...60  2e-08
hg17_chainMm5_chr7 range=chr10:124355391-124355690 5'pad=0 3'pad...60  2e-08
hg17_chainMm5_chr6 range=chr10:124355389-124355690 5'pad=0 3'pad...60  2e-08
hg17_chainMm5_chr7 range=chr10:124355388-124355687 5'pad=0 3'pad...60  2e-08

##

I would like to generate a file that looks for any results (the
hg17_etc  line) that occur in more than transcript (from the query
line ENSE1365999.1|ENSG0187908.1|ENST0339871.1)


so if  
hg17_chainMm5_chr7_random range=chr10:124355404-124355687 
 shows up again later in the file I want to know and want to record
where it is used more than once, otherwise I will ignore it.

I am think another reg expression to capture the transcript id
followed by  something that captures each of the results, and writes
to another file anytime a result appears more than once, and ties the
transcript ids to them somehow.

Any suggestions?
I agree if I had more time and was going to be doing more of this the
DB is the way to go.
-As an aside I have not looked into sqlite, I am hoping to avoid the
db right now, I'd have to get the sys admin to give me permission to
install something again etc etc. Where as I am hoping to get this
together in a reasonably short script.

 However I will look at it later (it could be helpful for other things for me.

Thanks again to all,  
Scott
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] sorting a 2 gb file

2005-01-25 Thread Danny Yoo


On Tue, 25 Jan 2005, Scott Melnyk wrote:


 I have an file in the form shown at the end (please forgive any
 wrapparounds due to the width of the screen here- the lines starting
 with ENS end with the e-12 or what have you on same line.)

 What I would like is to generate an output file of  any other
 ENSE000...e-4 (or whathaveyou) lines that appear in more than one
 place and for each of those the queries they appear related to.

Hi Scott,

One way to do this might be to do it in two passes across the file.

The first pass through the file can identify records that appear more than
once.  The second pass can take that knowledge, and then display those
records.

In pseudocode, this will look something like:

###
hints = identifyDuplicateRecords(filename)
displayDuplicateRecords(filename, hints)
###



 My data set the below is taken from is over 2.4 gb so speed and memory
 considerations come into play.

 Are sets more effective than lists for this?

Sets or dictionaries make the act of lookup of a key fairly cheap.  In
the two-pass approach, the first pass can use a dictionary to accumulate
the number of times a certain record's key has occurred.

Note that, because your file is so large, the dictionary probably
shouldn't accumulation the whole mass of information that we've seen so
far: instead, it's sufficient to record the information we need to
recognize a duplicate.


If you have more questions, please feel free to ask!

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] sorting a 2 gb file

2005-01-25 Thread Max Noel
On Jan 25, 2005, at 23:40, Danny Yoo wrote:
In pseudocode, this will look something like:
###
hints = identifyDuplicateRecords(filename)
displayDuplicateRecords(filename, hints)
###

My data set the below is taken from is over 2.4 gb so speed and memory
considerations come into play.
Are sets more effective than lists for this?
Sets or dictionaries make the act of lookup of a key fairly cheap.  
In
the two-pass approach, the first pass can use a dictionary to 
accumulate
the number of times a certain record's key has occurred.

Note that, because your file is so large, the dictionary probably
shouldn't accumulation the whole mass of information that we've seen so
far: instead, it's sufficient to record the information we need to
recognize a duplicate.
	However, the first pass will consume a lot of memory. Considering the 
worst-case scenario where each record only appears once, you'll find 
yourself with the whole 2GB file loaded into memory.
	(or do you have a smarter way to do this?)

-- Max
maxnoel_fr at yahoo dot fr -- ICQ #85274019
Look at you hacker... A pathetic creature of meat and bone, panting 
and sweating as you run through my corridors... How can you challenge a 
perfect, immortal machine?

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] sorting a 2 gb file

2005-01-25 Thread Danny Yoo


On Tue, 25 Jan 2005, Max Noel wrote:

  My data set the below is taken from is over 2.4 gb so speed and
  memory considerations come into play.
 
  Are sets more effective than lists for this?
 
  Sets or dictionaries make the act of lookup of a key fairly cheap.
  In the two-pass approach, the first pass can use a dictionary to
  accumulate the number of times a certain record's key has occurred.
 
  Note that, because your file is so large, the dictionary probably
  shouldn't accumulation the whole mass of information that we've seen
  so far: instead, it's sufficient to record the information we need to
  recognize a duplicate.

   However, the first pass will consume a lot of memory. Considering
 the worst-case scenario where each record only appears once, you'll find
 yourself with the whole 2GB file loaded into memory.
   (or do you have a smarter way to do this?)


Hi Max,

My assumptions are that each record consists of some identifying string
key that's associated to some value.  How are we deciding that two
records are talking about the same thing?


I'm hoping that the set of unique keys isn't itself very large.  Under
this assumption, we can do something like this:

###
from sets import Set
def firstPass(f):
Returns a set of the duplicate keys in f.
seenKeys = Set()
duplicateKeys = Set()
for record in f:
key = extractKey(record)
if key in seenKeys:
duplicateKeys.add(key)
else:
seenKeys.add(key)
return duplicateKeys
###

where we don't store the whole record into memory, but only the 'key'
portion of the record.

And if the number of unique keys is small enough, this should be fine
enough to recognize duplicate records.  So on the second passthrough, we
can display the duplicate records on-the-fly.  If this assumption is not
true, then we need to do something else.  *grin*

One possibility might be to implement an external sorting mechanism:

http://www.nist.gov/dads/HTML/externalsort.html


But if we're willing to do an external sort, then we're already doing
enough work that we should really consider using a DBMS.  The more
complicated the data management becomes, the more attractive it becomes to
use a real database to handle these data management issues.  We're trying
to solve a problem that is already solved by a real database management
system.


Talk to you later!

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor