Re: [Tutor] sorting a 2 gb file- i shrunk it and turned it around

2005-01-26 Thread Kent Johnson
My guess is that your file is small enough that Danny's two-pass approach will work. You might even 
be able to do it in one pass.

If you have enough RAM, here is a sketch of a one-pass solution:
# This will map each result to a list of queries that contain that result
results= {}
# Iterate the file building results
for line in file:
  if isQueryLine(line):
current_query = line
  elif isResultLine(line):
key = getResultKey(line)
results.setdefault(key, []).append(current_query) # see explanation below
# Now go through results looking for entries with more than one query
for key, queries in results.iteritems():
  if len(queries)  1:
print
print key
for query in queries:
  print query
You have to fill in the details of isQueryLine(), isResultLine() and getResultKey(); from your 
earlier posts I'm guessing you can figure them out.

What is this:
  results.setdefault(key, []).append(current_query)
setdefault() is a handy method of a dict. It looks up the value corresponding to the given key. If 
there is no value, it *sets* the value to the given default, and returns that. So after
  results.setdefault(key, [])
results[key] will have a list in it, and we will have a reference to the list.

Then appending the list adds the query to the list in the dict.
Please let us know what solution you end up using, and how much memory it 
needs. I'm interested...
Kent
Scott Melnyk wrote:
Thanks for the thoughts so far.  After posting I have been thinking
about how to pare down the file (much of the info in the big file was
not relevant to this question at hand).
After the first couple of responses I was even more motivated to
shrink the file so not have to set up a db. This test will be run only
now and later to verify with another test set so the db set up seemed
liked more work than might be worth it.
I was able to reduce my file down about 160 mb in size by paring out
every line not directly related to what I want by some simple regular
expressions and a couple tests for inclusion.
The format and what info is compared against what is different from my
original examples as I believe this is more clear.
my queries are named by the lines such as:
ENSE1387275.1|ENSG0187908.1|ENST0339871.1
ENSE is an exon   ENSG is the gene ENST is a transcript
They all have the above format, they differ in in numbers above
following ENS[E,G orT].
Each query is for a different exon.  For background each gene has many
exons and there are different versions of which exons are in each gene
in this dataset.  These different collections are the transcripts ie
ENST0339871.1
in short a transcript is a version of a gene here
transcript 1 may be formed of  exons a,b and c 
transcript 2 may contain exons a,b,d 


the other lines (results) are of the format
hg17_chainMm5_chr7_random range=chr10:124355404-124355687 5'pad=...44  0.001
hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...44  0.001
hg17_chainMm5_chr7_random range=chr10:124355404-124355687 is the
important part here from 5'pad on is not important at this point
What I am trying to do is now make a list of any of the results that
appear in more than one transcript
##
FILE SAMPLE:
This is the number 1  query tested.
Results for scoring against Query=
ENSE1387275.1|ENSG0187908.1|ENST0339871.1
 are: 

hg17_chainMm5_chr7_random range=chr10:124355404-124355687 5'pad=...44  0.001
hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...44  0.001
hg17_chainMm5_chr7 range=chr10:124355391-124355690 5'pad=0 3'pad...44  0.001
hg17_chainMm5_chr6 range=chr10:124355389-124355690 5'pad=0 3'pad...44  0.001
hg17_chainMm5_chr7 range=chr10:124355388-124355687 5'pad=0 3'pad...44  0.001
hg17_chainMm5_chr7_random range=chr10:124355388-124355719 5'pad=...44  0.001

This is the number 3  query tested.
Results for scoring against Query=
ENSE1365999.1|ENSG0187908.1|ENST0339871.1
 are: 

hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...60  2e-08
hg17_chainMm5_chr7 range=chr10:124355391-124355690 5'pad=0 3'pad...60  2e-08
hg17_chainMm5_chr6 range=chr10:124355389-124355690 5'pad=0 3'pad...60  2e-08
hg17_chainMm5_chr7 range=chr10:124355388-124355687 5'pad=0 3'pad...60  2e-08
##
I would like to generate a file that looks for any results (the
hg17_etc  line) that occur in more than transcript (from the query
line ENSE1365999.1|ENSG0187908.1|ENST0339871.1)
so if  
hg17_chainMm5_chr7_random range=chr10:124355404-124355687 
 shows up again later in the file I want to know and want to record
where it is used more than once, otherwise I will ignore it.

I am think another reg expression to capture the transcript id
followed by  something that captures each of the results, and writes
to another file anytime a result appears more than once, and ties the
transcript ids to them somehow.
Any suggestions?
I agree if I had more time 

RE: [Tutor] sorting a 2 gb file- i shrunk it and turned it around

2005-01-25 Thread Scott Melnyk
Thanks for the thoughts so far.  After posting I have been thinking
about how to pare down the file (much of the info in the big file was
not relevant to this question at hand).

After the first couple of responses I was even more motivated to
shrink the file so not have to set up a db. This test will be run only
now and later to verify with another test set so the db set up seemed
liked more work than might be worth it.

I was able to reduce my file down about 160 mb in size by paring out
every line not directly related to what I want by some simple regular
expressions and a couple tests for inclusion.

The format and what info is compared against what is different from my
original examples as I believe this is more clear.


my queries are named by the lines such as:
ENSE1387275.1|ENSG0187908.1|ENST0339871.1
ENSE is an exon   ENSG is the gene ENST is a transcript

They all have the above format, they differ in in numbers above
following ENS[E,G orT].

Each query is for a different exon.  For background each gene has many
exons and there are different versions of which exons are in each gene
in this dataset.  These different collections are the transcripts ie
ENST0339871.1

in short a transcript is a version of a gene here
transcript 1 may be formed of  exons a,b and c 
transcript 2 may contain exons a,b,d 



the other lines (results) are of the format
hg17_chainMm5_chr7_random range=chr10:124355404-124355687 5'pad=...44  0.001
hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...44  0.001

hg17_chainMm5_chr7_random range=chr10:124355404-124355687 is the
important part here from 5'pad on is not important at this point


What I am trying to do is now make a list of any of the results that
appear in more than one transcript

##
FILE SAMPLE:

This is the number 1  query tested.
Results for scoring against Query=
ENSE1387275.1|ENSG0187908.1|ENST0339871.1
 are: 

hg17_chainMm5_chr7_random range=chr10:124355404-124355687 5'pad=...44  0.001
hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...44  0.001
hg17_chainMm5_chr7 range=chr10:124355391-124355690 5'pad=0 3'pad...44  0.001
hg17_chainMm5_chr6 range=chr10:124355389-124355690 5'pad=0 3'pad...44  0.001
hg17_chainMm5_chr7 range=chr10:124355388-124355687 5'pad=0 3'pad...44  0.001
hg17_chainMm5_chr7_random range=chr10:124355388-124355719 5'pad=...44  0.001



This is the number 3  query tested.
Results for scoring against Query=
ENSE1365999.1|ENSG0187908.1|ENST0339871.1
 are: 

hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...60  2e-08
hg17_chainMm5_chr7 range=chr10:124355391-124355690 5'pad=0 3'pad...60  2e-08
hg17_chainMm5_chr6 range=chr10:124355389-124355690 5'pad=0 3'pad...60  2e-08
hg17_chainMm5_chr7 range=chr10:124355388-124355687 5'pad=0 3'pad...60  2e-08

##

I would like to generate a file that looks for any results (the
hg17_etc  line) that occur in more than transcript (from the query
line ENSE1365999.1|ENSG0187908.1|ENST0339871.1)


so if  
hg17_chainMm5_chr7_random range=chr10:124355404-124355687 
 shows up again later in the file I want to know and want to record
where it is used more than once, otherwise I will ignore it.

I am think another reg expression to capture the transcript id
followed by  something that captures each of the results, and writes
to another file anytime a result appears more than once, and ties the
transcript ids to them somehow.

Any suggestions?
I agree if I had more time and was going to be doing more of this the
DB is the way to go.
-As an aside I have not looked into sqlite, I am hoping to avoid the
db right now, I'd have to get the sys admin to give me permission to
install something again etc etc. Where as I am hoping to get this
together in a reasonably short script.

 However I will look at it later (it could be helpful for other things for me.

Thanks again to all,  
Scott
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor