[Tutor] variation of Unique items question

2005-02-04 Thread Scott Melnyk
Hello once more.

I am stuck on how best to tie the finding Unique Items in Lists ideas to my file

I am stuck at level below:  What I have here taken from the unique
items thread does not work as I need to separate each grouping to the
hg chain it is in (see below for examples)

import sys
WFILE=open(sys.argv[1], 'w') 
def get_list_dup_dict(fname='Z:/datasets/fooyoo.txt', threshold=2):
a_list=open(fname, 'r')
   #print beginning get_list_dup
items_dict, dup_dict = {}, {}

for i in a_list:
items_dict[i] = items_dict.get(i, 0) + 1

for k, v in items_dict.iteritems():
if v==threshold:
dup_dict[k] = v

return dup_dict

def print_list_dup_report(fname='Z:/datasets/fooyoo.txt', threshold=2):
#print Beginning report generation
dup_dict = get_list_dup_dict(fname='Z:/datasets/fooyoo.txt', threshold=2)
for k, v in sorted(dup_dict.iteritems()):
print WFILE,'%s occurred %s times' %(k, v)

if __name__ == '__main__':
print_list_dup_report()


My issue is that my file is as follows:
hg17_chainMm5_chr15 range=chr7:148238502-148239073
ENST0339563.1
ENST0342196.1
ENST0339563.1
ENST0344055.1

hg17_chainMm5_chr13 range=chr5:42927967-42928726
ENST0279800.3
ENST0309556.3

hg17_chainMm5_chr6 range=chr1:155548627-155549517
ENST0321157.3
ENST0256324.4
  
I need a print out that would give the line hg17 and then any
instances of the ENST that occur more than once only for that chain
section.  Even better it only prints the hg17 line if it is followed
by an instance of ENST that occurs more than once

I am hoping for something that gives me an out file roughly like:

hg17_chainMm5_chr15 range=chr7:148238502-148239073
ENST0339563.1 occurs 2 times

hg17_chainMm5_chr13 range=chr5:42927967-42928726
ENST0279800.3 occurs 2 times
 

All help and ideas appreciated, I am trying to get this finished as
soon as possible, the output file will be used to go back to my 2 gb
file and pull out the rest of the data I need.

Thanks,
Scott
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Re: variation of Unique items question

2005-02-04 Thread Scott Melnyk
Hello.

Kent once again you have responded incredibly quickly in a most
helpful manor.  I sometimes wonder if the old reference to a
Kent-bot has some truth to it.

Thanks again, I will play with it and keep on going.

Scott
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] sorting a 2 gb file

2005-01-25 Thread Scott Melnyk
Hello!

I am wondering about the best way to handle sorting some data from
some of my results.

I have an file in the form shown at the end  (please forgive any
wrapparounds due to the width of the  screen here- the lines starting
with ENS end with the e-12 or what have you on same line.)

What I would like is to generate an output file of  any other
ENSE000...e-4 (or whathaveyou) lines that appear in more than one
place and for each of those the queries they appear related to.

So if the first line
ENSE1098330.2|ENSG0013573.6|ENST0350437.2 assembly=N...
etc appears as a result in any other query I would like it and the
queries it appears as a result to (including the score if possible).

My data set the below is taken from is over 2.4 gb so speed and memory
considerations come into play.  Are sets more effective than lists for
this?  To save space in the new file I really only need the name of
the result up to the | and the score at the end for each.
to simplify things, the score could be dropped, and I could check it
out as needed later.

As always all feedback is very appreciated. 

Thanks,
Scott

FILE:

This is the number 1  query tested.
Results for scoring against Query= hg17_chainMm5_chr17
range=chr1:2040-3330 5'pad=0 3'pad=0
 are: 

ENSE1098330.2|ENSG0013573.6|ENST0350437.2 assembly=N...72  1e-12
ENSE1160046.1|ENSG0013573.6|ENST0251758.3 assembly=N...72  1e-12
ENSE1404464.1|ENSG0013573.6|ENST0228264.4 assembly=N...72  1e-12
ENSE1160046.1|ENSG0013573.6|ENST0290818.3 assembly=N...72  1e-12
ENSE1343865.2|ENSG0013573.6|ENST0350437.2 assembly=N...46  8e-05
ENSE1160049.1|ENSG0013573.6|ENST0251758.3 assembly=N...46  8e-05
ENSE1343865.2|ENSG0013573.6|ENST0228264.4 assembly=N...46  8e-05
ENSE1160049.1|ENSG0013573.6|ENST0290818.3 assembly=N...46  8e-05

This is the number 2  query tested.
Results for scoring against Query= hg17_chainMm5_chr1
range=chr1:82719-95929 5'pad=0 3'pad=0
 are: 

ENSE1373792.1|ENSG0175182.4|ENST0310585.3 assembly=N...80  6e-14
ENSE1134144.2|ENSG0160013.2|ENST0307155.2 assembly=N...78  2e-13
ENSE1433065.1|ENSG0185480.2|ENST0358383.1 assembly=N...78  2e-13
ENSE1422761.1|ENSG0183160.2|ENST0360503.1 assembly=N...74  4e-12
ENSE1431410.1|ENSG0139631.6|ENST0308926.3 assembly=N...74  4e-12
ENSE1433065.1|ENSG0185480.2|ENST0358383.1 assembly=N...72  1e-11
ENSE1411753.1|ENSG0126882.4|ENST0358329.1 assembly=N...72  1e-11
ENSE1428167.1|ENSG0110497.4|ENST0314823.4 assembly=N...72  1e-11
ENSE1401130.1|ENSG0160828.5|ENST0359898.1 assembly=N...72  1e-11
ENSE1414900.1|ENSG0176920.4|ENST0356650.1 assembly=N...72  1e-11
ENSE1428167.1|ENSG0110497.4|ENST0314823.4 assembly=N...72  1e-11
ENSE1400942.1|ENSG0138670.5|ENST0356373.1 assembly=N...72  1e-11
ENSE1400116.1|ENSG0120907.6|ENST0356368.1 assembly=N...70  6e-11
ENSE1413546.1|ENSG0184209.6|ENST0344033.2 assembly=N...70  6e-11
ENSE1433572.1|ENSG0124243.5|ENST0355583.1 assembly=N...70  6e-11
ENSE1423154.1|ENSG0125875.4|ENST0354200.1 assembly=N...70  6e-11
ENSE1400109.1|ENSG0183785.3|ENST0339190.2 assembly=N...70  6e-11
ENSE1268950.4|ENSG0084112.4|ENST0303438.2 assembly=N...68  2e-10
ENSE1057279.1|ENSG0161270.6|ENST0292886.2 assembly=N...68  2e-10
ENSE1434317.1|ENSG0171453.2|ENST0304004.2 assembly=N...68  2e-10
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


RE: [Tutor] sorting a 2 gb file- i shrunk it and turned it around

2005-01-25 Thread Scott Melnyk
Thanks for the thoughts so far.  After posting I have been thinking
about how to pare down the file (much of the info in the big file was
not relevant to this question at hand).

After the first couple of responses I was even more motivated to
shrink the file so not have to set up a db. This test will be run only
now and later to verify with another test set so the db set up seemed
liked more work than might be worth it.

I was able to reduce my file down about 160 mb in size by paring out
every line not directly related to what I want by some simple regular
expressions and a couple tests for inclusion.

The format and what info is compared against what is different from my
original examples as I believe this is more clear.


my queries are named by the lines such as:
ENSE1387275.1|ENSG0187908.1|ENST0339871.1
ENSE is an exon   ENSG is the gene ENST is a transcript

They all have the above format, they differ in in numbers above
following ENS[E,G orT].

Each query is for a different exon.  For background each gene has many
exons and there are different versions of which exons are in each gene
in this dataset.  These different collections are the transcripts ie
ENST0339871.1

in short a transcript is a version of a gene here
transcript 1 may be formed of  exons a,b and c 
transcript 2 may contain exons a,b,d 



the other lines (results) are of the format
hg17_chainMm5_chr7_random range=chr10:124355404-124355687 5'pad=...44  0.001
hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...44  0.001

hg17_chainMm5_chr7_random range=chr10:124355404-124355687 is the
important part here from 5'pad on is not important at this point


What I am trying to do is now make a list of any of the results that
appear in more than one transcript

##
FILE SAMPLE:

This is the number 1  query tested.
Results for scoring against Query=
ENSE1387275.1|ENSG0187908.1|ENST0339871.1
 are: 

hg17_chainMm5_chr7_random range=chr10:124355404-124355687 5'pad=...44  0.001
hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...44  0.001
hg17_chainMm5_chr7 range=chr10:124355391-124355690 5'pad=0 3'pad...44  0.001
hg17_chainMm5_chr6 range=chr10:124355389-124355690 5'pad=0 3'pad...44  0.001
hg17_chainMm5_chr7 range=chr10:124355388-124355687 5'pad=0 3'pad...44  0.001
hg17_chainMm5_chr7_random range=chr10:124355388-124355719 5'pad=...44  0.001



This is the number 3  query tested.
Results for scoring against Query=
ENSE1365999.1|ENSG0187908.1|ENST0339871.1
 are: 

hg17_chainMm5_chr14 range=chr10:124355392-124355530 5'pad=0 3'pa...60  2e-08
hg17_chainMm5_chr7 range=chr10:124355391-124355690 5'pad=0 3'pad...60  2e-08
hg17_chainMm5_chr6 range=chr10:124355389-124355690 5'pad=0 3'pad...60  2e-08
hg17_chainMm5_chr7 range=chr10:124355388-124355687 5'pad=0 3'pad...60  2e-08

##

I would like to generate a file that looks for any results (the
hg17_etc  line) that occur in more than transcript (from the query
line ENSE1365999.1|ENSG0187908.1|ENST0339871.1)


so if  
hg17_chainMm5_chr7_random range=chr10:124355404-124355687 
 shows up again later in the file I want to know and want to record
where it is used more than once, otherwise I will ignore it.

I am think another reg expression to capture the transcript id
followed by  something that captures each of the results, and writes
to another file anytime a result appears more than once, and ties the
transcript ids to them somehow.

Any suggestions?
I agree if I had more time and was going to be doing more of this the
DB is the way to go.
-As an aside I have not looked into sqlite, I am hoping to avoid the
db right now, I'd have to get the sys admin to give me permission to
install something again etc etc. Where as I am hoping to get this
together in a reasonably short script.

 However I will look at it later (it could be helpful for other things for me.

Thanks again to all,  
Scott
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] sorting and editing large data files

2004-12-16 Thread Scott Melnyk
Hello!

I recently suffered a loss of programming files (and I had been
putting off my backups...)

The project I am working on involved the use of programs which I had
been assisted with by some of the group.

I am rewritten the programs however my results are not what I expected.

I would appreciate anyone who could follow through the steps and
programs below and tell me if it seems correct, which it does to me. 
It could be the error is based on a flaw that I had made before.

I will give very simplified explanations of the data to help.

Starting a file with genetic data about 95 mb

A sample of the file format is:

ENSE1384652.1|ENSG0166157.5|ENST0359693.1
assembly=NCBI35|chr=21|strand=reverse|bases 10012801 to 10012624|exons
plus upstream and downstream regions for exon
GCGGCCGTTCAAGGCAGCCGTCTCCGAGCGGCCCAAGGGAGGGCACAACAGCTGCTACCTGAACAGTTTCTGACCCAACAGTTACCCAGCGCCGGACTCGCTGCGGGCGGCTCTAGGGACGGCGCCTACACTTAGCTCCGCGCCCGAGGTGAGCCCAG

It is hard to tell due to email formating however  from the  to the
words regions for exon is one line, all the GCATs are on one line.

ENSExxx is an exon id tag  followed by a ENSGxgene id tag then
a ENSTxxx transcript id tag followed by information about the
location of exon.

the letters GCTA are the actually exon sequence.

1 gene can have different versions, these are called transcripts and
each transcript contains a number of exons

In order to visually understand the data better I made a script to
organize it into sections with Gene ID, then Trancript ID followed by
the different  Exon IDs like so:



NEW GENE
ENSG0166157.5 ENST0359693.1 ENSE1384652.1
ENSE1365174.1 ENSE1372282.1 ENSE1369818.1
ENSE1371166.1 ENSE1369770.1 ENSE1122873.3
ENSE1156126.1 ENSE1156118.1 ENSE1378322.1
ENSE1156105.1 ENSE1156099.1 ENSE1156092.1
ENSE1156083.1 ENSE1156075.1 ENSE1156069.1
ENSE1156063.1 ENSE1156057.1 ENSE1100323.1
ENSE1156046.1 ENSE1126180.1 ENSE1100365.1
ENSE1156031.1 ENSE1231719.5

NEW TRANSCRIPT
ENSG0166157.5 ENST0298232.4 ENSE1384652.1
ENSE1365174.1 ENSE1372282.1 ENSE1369818.1
ENSE1371166.1 ENSE1369770.1 ENSE1156126.1
ENSE1156118.1 ENSE1378322.1 ENSE1156105.1
ENSE1156099.1 ENSE1156092.1 ENSE1156083.1
ENSE1156075.1 ENSE1156069.1 ENSE1156063.1
ENSE1156057.1 ENSE1100323.1 ENSE1156046.1
ENSE1126180.1 ENSE1100365.1 ENSE1156031.1
ENSE1231719.5

NEW TRANSCRIPT
ENSG0166157.5 ENST0342420.2 ENSE1384652.1
ENSE1365174.1 ENSE1372282.1 ENSE1369818.1
ENSE1371166.1 ENSE1369770.1 ENSE1156118.1
ENSE1378322.1 ENSE1156105.1 ENSE1156099.1
ENSE1156092.1 ENSE1156083.1 ENSE1156075.1
ENSE1156069.1 ENSE1156063.1 ENSE1156057.1
ENSE1100323.1 ENSE1156046.1 ENSE1126180.1
ENSE1100365.1 ENSE1156031.1 ENSE1231719.5

NEW TRANSCRIPT
ENSG0166157.5 ENST0339704.2 ENSE1384652.1
ENSE1364671.1 ENSE1387876.1 ENSE1384389.1
ENSE1156111.1 ENSE1156105.1 ENSE1156099.1
ENSE1156092.1 ENSE1156083.1 ENSE1156075.1
ENSE1156069.1 ENSE1156063.1 ENSE1156057.1
ENSE1100323.1 ENSE1156046.1 ENSE1126180.1
ENSE1100365.1 ENSE1156031.1 ENSE1231719.5
end of gene group

NEW GENE
ENSG0187172.4 ENST0335369.2 ENSE1428001.1
ENSE1331994.3 ENSE1398768.1 ENSE1429773.1
etc.

This was accomplished with the following program I called OrganizeData.py

#3
#execute with python datafile outputfile

import re,sys,string

#regular expression to pull out gene, transcript and exon ids

info=re.compile('^(ENSE\d+\.\d).+(ENSG\d+\.\d).+(ENST\d+\.\d)')
#   sample from the file 
#ENSE1373825.1|ENSG0187908.1|ENST0344467.1


file = open( sys.argv[1], 'r' ) #file to read from  
WriteFile=open(sys.argv[2], 'w')#file to write to

#initialize some variables  GENEID is the name of the gene, TRANSID is
name of the transcript etc
sOldGene=
sGENEID=
sOldTrans=
sTRANSID=
sEXONID=

for line in file:
if info.match(line):
Matched= info.match(line)

sEXONID=Matched.group(1)
sGENEID=Matched.group(2)
sTRANSID=Matched.group(3)

if sOldGene==sGENEID:   #if current line is the same gene as 
last line
if sOldTrans == sTRANSID:  #if current line is same 
transcript as last line
print  WriteFile, sEXONID,#add new exon 
to the current transcript line

else:
print  WriteFile,\n\n\tNEW TRANSCRIPT\n, 
sGENEID, sTRANSID,
sEXONID,   #new transcript of same gene print 2