On Thursday, January 30, 2014 4:20:26 PM UTC+5:30, Ayushi Dalmia wrote: > Hello, > > > > I need to randomly access a bzip2 or gzip file. How can I set the offset for > a line and later retreive the line from the file using the offset. Pointers > in this direction will help.
This is what I have done: import bz2 import sys from random import randint index={} data=[] f=open('temp.txt','r') for line in f: data.append(line) filename='temp1.txt.bz2' with bz2.BZ2File(filename, 'wb', compresslevel=9) as f: f.writelines(data) prevsize=0 list1=[] offset={} with bz2.BZ2File(filename, 'rb') as f: for line in f: words=line.strip().split(' ') list1.append(words[0]) offset[words[0]]= prevsize prevsize = sys.getsizeof(line)+prevsize data=[] count=0 with bz2.BZ2File(filename, 'rb') as f: while count<20: y=randint(1,25) print y print offset[str(y)] count+=1 f.seek(int(offset[str(y)])) x= f.readline() data.append(x) f=open('b.txt','w') f.write(''.join(data)) f.close() where temp.txt is the posting list file which is first written in a compressed format and then read later. I am trying to build the index for the entire wikipedia dump which needs to be done in a space and time optimised way. The temp.txt is as follows: 1 456 t0b3c0i0e0:784 t0b2c0i0e0:801 t0b2c0i0e0 2 221 t0b1c0i0e0:774 t0b1c0i0e0:801 t0b2c0i0e0 3 455 t0b7c0i0e0:456 t0b1c0i0e0:459 t0b2c0i0e0:669 t0b10c11i3e0:673 t0b1c0i0e0:678 t0b2c0i1e0:854 t0b1c0i0e0 4 410 t0b4c0i0e0:553 t0b1c0i0e0:609 t0b1c0i0e0 5 90 t0b1c0i0e0 6 727 t0b2c0i0e0 7 431 t0b2c0i1e0 8 532 t0b1c0i0e0:652 t0b1c0i0e0:727 t0b2c0i0e0 9 378 t0b1c0i0e0 10 666 t0b2c0i0e0 11 405 t0b1c0i0e0 12 702 t0b1c0i0e0 13 755 t0b1c0i0e0 14 781 t0b1c0i0e0 15 593 t0b1c0i0e0 16 725 t0b1c0i0e0 17 989 t0b2c0i1e0 18 221 t0b1c0i0e0:402 t0b1c0i0e0:842 t0b1c0i0e0 19 405 t0b1c0i0e0 20 200 t0b1c0i0e0:300 t0b1c0i0e0:398 t0b1c0i0e0:649 t0b1c0i0e0 21 66 t0b1c0i0e0 22 30 t0b1c0i0e0 23 126 t0b1c0i0e0:895 t0b1c0i0e0 24 355 t0b1c0i0e0:374 t0b1c0i0e0:378 t0b1c0i0e0:431 t0b3c0i0e0:482 t0b1c0i0e0:546 t0b3c0i0e0:578 t0b1c0i0e0 25 198 t0b1c0i0e0 -- https://mail.python.org/mailman/listinfo/python-list