Re: [Tutor] variation of Unique items question

2005-02-04 Thread Kent Johnson
You need to reset your items_dict when you see an hg17 line.
Here is one way to do it. I used a class to make it easier to break the problem into functions. 
Putting the functions in a class makes it easy to share the header and counts.

class Grouper:
''' Process a sequence of strings of the form
Header
Data
Data
Header
...
Look for repeated Data items under a single Header. When found, print
the Header and the repeated item.
Possible usage:
out = open('outfile.txt', 'w')
Grouper().process(open('infile.txt'), 'hg17', out)
out.close()
'''
def reset(self, header='No header'):
''' Reset the current header and counts '''
self.currHeader = header
self.counts = {}
def process(self, data, headerStart, out):
''' Find duplicates within groups of lines of data '''
self.reset()
for line in data:
line = line.strip() # get rid of newlines from file input
if line.startswith(headerStart):
# Found a new header line, show the current group and restart
self.showDups(out)
self.reset(line)
elif line:
# Found a data line, count it
self.counts[line] = self.counts.get(line, 0) + 1
# Show the last group
self.showDups(out)
def showDups(self, out):
# Get list of items with count > 1
items = [ (k, cnt) for k, cnt in self.counts.items() if cnt > 1 ]
# Show the items
if items:
items.sort()
print >> out, self.currHeader
for k, cnt in sorted(items):
print >> out, '%s occurs %d times' % (k, cnt)
print >> out
if __name__ == '__main__':
import sys
data = '''hg17_chainMm5_chr15 range=chr7:148238502-148239073
ENST0339563.1
ENST0342196.1
ENST0339563.1
ENST0344055.1
hg17_chainMm5_chr13 range=chr5:42927967-42928726
ENST0279800.3
ENST0309556.3
ENST0279800.3
hg17_chainMm5_chr6 range=chr1:155548627-155549517
ENST0321157.3
ENST0256324.4'''.split('\n')
Grouper().process(data, 'hg17', sys.stdout)
Kent
Scott Melnyk wrote:
Hello once more.
I am stuck on how best to tie the finding Unique Items in Lists ideas to my file
I am stuck at level below:  What I have here taken from the unique
items thread does not work as I need to separate each grouping to the
hg chain it is in (see below for examples)
import sys
WFILE=open(sys.argv[1], 'w') 
def get_list_dup_dict(fname='Z:/datasets/fooyoo.txt', threshold=2):
a_list=open(fname, 'r')
   #print "beginning get_list_dup"
items_dict, dup_dict = {}, {}

for i in a_list:
items_dict[i] = items_dict.get(i, 0) + 1

for k, v in items_dict.iteritems():
if v==threshold:
dup_dict[k] = v

return dup_dict
def print_list_dup_report(fname='Z:/datasets/fooyoo.txt', threshold=2):
#print "Beginning report generation"
dup_dict = get_list_dup_dict(fname='Z:/datasets/fooyoo.txt', threshold=2)
for k, v in sorted(dup_dict.iteritems()):
print WFILE,'%s occurred %s times' %(k, v)
if __name__ == '__main__':
print_list_dup_report()
My issue is that my file is as follows:
hg17_chainMm5_chr15 range=chr7:148238502-148239073
ENST0339563.1
ENST0342196.1
ENST0339563.1
ENST0344055.1
hg17_chainMm5_chr13 range=chr5:42927967-42928726
ENST0279800.3
ENST0309556.3
hg17_chainMm5_chr6 range=chr1:155548627-155549517
ENST0321157.3
ENST0256324.4
  
I need a print out that would give the line hg17 and then any
instances of the ENST that occur more than once only for that chain
section.  Even better it only prints the hg17 line if it is followed
by an instance of ENST that occurs more than once

I am hoping for something that gives me an out file roughly like:
hg17_chainMm5_chr15 range=chr7:148238502-148239073
ENST0339563.1 occurs 2 times
hg17_chainMm5_chr13 range=chr5:42927967-42928726
ENST0279800.3 occurs 2 times
 

All help and ideas appreciated, I am trying to get this finished as
soon as possible, the output file will be used to go back to my 2 gb
file and pull out the rest of the data I need.
Thanks,
Scott
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] variation of Unique items question

2005-02-04 Thread Scott Melnyk
Hello once more.

I am stuck on how best to tie the finding Unique Items in Lists ideas to my file

I am stuck at level below:  What I have here taken from the unique
items thread does not work as I need to separate each grouping to the
hg chain it is in (see below for examples)

import sys
WFILE=open(sys.argv[1], 'w') 
def get_list_dup_dict(fname='Z:/datasets/fooyoo.txt', threshold=2):
a_list=open(fname, 'r')
   #print "beginning get_list_dup"
items_dict, dup_dict = {}, {}

for i in a_list:
items_dict[i] = items_dict.get(i, 0) + 1

for k, v in items_dict.iteritems():
if v==threshold:
dup_dict[k] = v

return dup_dict

def print_list_dup_report(fname='Z:/datasets/fooyoo.txt', threshold=2):
#print "Beginning report generation"
dup_dict = get_list_dup_dict(fname='Z:/datasets/fooyoo.txt', threshold=2)
for k, v in sorted(dup_dict.iteritems()):
print WFILE,'%s occurred %s times' %(k, v)

if __name__ == '__main__':
print_list_dup_report()


My issue is that my file is as follows:
hg17_chainMm5_chr15 range=chr7:148238502-148239073
ENST0339563.1
ENST0342196.1
ENST0339563.1
ENST0344055.1

hg17_chainMm5_chr13 range=chr5:42927967-42928726
ENST0279800.3
ENST0309556.3

hg17_chainMm5_chr6 range=chr1:155548627-155549517
ENST0321157.3
ENST0256324.4
  
I need a print out that would give the line hg17 and then any
instances of the ENST that occur more than once only for that chain
section.  Even better it only prints the hg17 line if it is followed
by an instance of ENST that occurs more than once

I am hoping for something that gives me an out file roughly like:

hg17_chainMm5_chr15 range=chr7:148238502-148239073
ENST0339563.1 occurs 2 times

hg17_chainMm5_chr13 range=chr5:42927967-42928726
ENST0279800.3 occurs 2 times
 

All help and ideas appreciated, I am trying to get this finished as
soon as possible, the output file will be used to go back to my 2 gb
file and pull out the rest of the data I need.

Thanks,
Scott
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor