Hi all, I have two files:
- PSP0000320.dat (quite a large list of mobile numbers), - CBR0000319.dat (a subset of the above, a list of barred bumbers) # head PSP0000320.dat CBR0000319.dat ==> PSP0000320.dat <== 96653696338 96653766996 96654609431 96654722608 96654738074 96655697044 96655824738 96656190117 96656256762 96656263751 ==> CBR0000319.dat <== 96651131135 96651131135 96651420412 96651730095 96652399117 96652399142 96652399142 96652399142 96652399160 96652399271 Objective: to remove the numbers present in barred-list from the PSPfile. $ ls -lh PSP0000320.dat CBR0000319.dat ... 56M Dec 28 19:41 PSP0000320.dat ... 8.6M Dec 28 19:40 CBR0000319.dat $ wc -l PSP0000320.dat CBR0000319.dat 4,462,603 PSP0000320.dat 693,585 CBR0000319.dat I wrote the following in python to do it: #: c01:rmcommon.py barredlist = open(r'/home/sjd/python/wip/CBR0000319.dat', 'r') postlist = open(r'/home/sjd/python/wip/PSP0000320.dat', 'r') outfile = open(r'/home/sjd/python/wip/PSP-CBR.dat', 'w') # reading it all in one go, so as to avoid frequent disk accesses (assume machine has plenty memory) barredlist.read() postlist.read() # for number in postlist: if number in barrlist: pass else: outfile.write(number) barredlist.close(); postlist.close(); outfile.close() #:~ The above code simply takes too long to complete. If I were to do a diff -y PSP0000320.dat CBR0000319.dat, catch the '<' & clean it up with sed -e 's/\([0-9]*\) *</\1/' > PSP-CBR.dat it takes <4 minutes to complete. I wrote the following in bash to do the same: #!/bin/bash ARGS=2 if [ $# -ne $ARGS ] # takes two arguments then echo; echo "Usage: `basename $0` {PSPfile} {CBRfile}" echo; echo " eg.: `basename $0` PSP0000320.dat CBR0000319.dat"; echo; echo "NOTE: first argument: PSP file, second: CBR file"; echo " this script _does_ no_ input validation!" exit 1 fi; # fix prefix; cost: 12.587 secs cat $1 | sed -e 's/^0*/966/' > $1.good cat $2 | sed -e 's/^0*/966/' > $2.good # sort/save files; for the 4,462,603 lines, cost: 36.589 secs sort $1.good > $1.sorted sort $2.good > $2.sorted # diff -y {PSP} {CBR}, grab the ones in PSPfile; cost: 31.817 secs diff -y $1.sorted $2.sorted | grep "<" > $1.filtered # remove trailing junk [spaces & <]; cost: 1 min 3 secs cat $1.filtered | sed -e 's/\([0-9]*\) *</\1/' > $1.cleaned # remove intermediate files, good, sorted, filtered rm -f *.good *.sorted *.filtered #:~ ...but strangely though, there's a discrepancy, the reason for which I can't figure out! Needless to say, I'm utterly new to python and my programming skills & know-how are rudimentary. Any help will be genuinely appreciated. -- fynali -- http://mail.python.org/mailman/listinfo/python-list