On Tuesday, February 4, 2014 2:27:38 AM UTC+5:30, Dave Angel wrote: > Ayushi Dalmia <ayushidalmia2...@gmail.com> Wrote in message: > > > On Thursday, January 30, 2014 4:20:26 PM UTC+5:30, Ayushi Dalmia wrote: > > >> Hello, > > >> > > >> > > >> > > >> I need to randomly access a bzip2 or gzip file. How can I set the offset > >> for a line and later retreive the line from the file using the offset. > >> Pointers in this direction will help. > > > > > > This is what I have done: > > > > > > import bz2 > > > import sys > > > from random import randint > > > > > > index={} > > > > > > data=[] > > > f=open('temp.txt','r') > > > for line in f: > > > data.append(line) > > > > > > filename='temp1.txt.bz2' > > > with bz2.BZ2File(filename, 'wb', compresslevel=9) as f: > > > f.writelines(data) > > > > > > prevsize=0 > > > list1=[] > > > offset={} > > > with bz2.BZ2File(filename, 'rb') as f: > > > for line in f: > > > words=line.strip().split(' ') > > > list1.append(words[0]) > > > offset[words[0]]= prevsize > > > prevsize = sys.getsizeof(line)+prevsize > > > > sys.getsizeof looks at internal size of a python object, and is > > totally unrelated to a size on disk of a text line. len () might > > come closer, unless you're on Windows. You really should be using > > tell to define the offsets for later seek. In text mode any other > > calculation is not legal, ie undefined. > > > > > > > > > > > data=[] > > > count=0 > > > > > > with bz2.BZ2File(filename, 'rb') as f: > > > while count<20: > > > y=randint(1,25) > > > print y > > > print offset[str(y)] > > > count+=1 > > > f.seek(int(offset[str(y)])) > > > x= f.readline() > > > data.append(x) > > > > > > f=open('b.txt','w') > > > f.write(''.join(data)) > > > f.close() > > > > > > where temp.txt is the posting list file which is first written in a > > compressed format and then read later. > > > > I thought you were starting with a compressed file. If you're > > being given an uncompressed file, just deal with it directly. > > > > > > >I am trying to build the index for the entire wikipedia dump which needs to > >be done in a space and time optimised way. The temp.txt is as follows: > > > > > > 1 456 t0b3c0i0e0:784 t0b2c0i0e0:801 t0b2c0i0e0 > > > 2 221 t0b1c0i0e0:774 t0b1c0i0e0:801 t0b2c0i0e0 > > > 3 455 t0b7c0i0e0:456 t0b1c0i0e0:459 t0b2c0i0e0:669 t0b10c11i3e0:673 > > t0b1c0i0e0:678 t0b2c0i1e0:854 t0b1c0i0e0 > > > 4 410 t0b4c0i0e0:553 t0b1c0i0e0:609 t0b1c0i0e0 > > > 5 90 t0b1c0i0e0 > > > > So every line begins with its line number in ascii form? If true, > > the dict above called offsets should just be a list. > > > > > > Maybe you should just quote the entire assignment. You're > > probably adding way too much complication to it. > > > > -- > > DaveA
Hey! I am new here. Sorry about the incorrect posts. Didn't understand the protocol then. Although, I have the uncompressed text, I cannot start right away with them -- https://mail.python.org/mailman/listinfo/python-list