On Fri, 07 Jan 2011 22:43:54 -0600, Keith Anthony wrote: > My previous question asked how to read a file into a strcuture a line at > a time. Figured it out. Now I'm trying to use .find to separate out > the PDF objects. (See code) PROBLEM/QUESTION: My call to lines[i].find > does NOT find all instances of endobj. Any help available? Any > insights? > > #!/usr/bin/python > > inputfile = file('sample.pdf','rb') # This is PDF with which > we will work > lines = inputfile.readlines() # read file > one line at a time
That's incorrect. readlines() reads the entire file in one go, and splits it into individual lines. > linestart = [] # Starting address for > each line > lineend = [] # Ending > address for each line > linetype = [] *raises eyebrow* How is an empty list a starting or ending address? The only thing worse than no comments where you need them is misleading comments. A variable called "linestart" implies that it should be a position, e.g. linestart = 0. Or possibly a flag. > print len(lines) # print number of lines > > i = 0 # define an iterator, i Again, 0 is not an iterator. 0 is a number. > addr = 0 # and address pointer > > while i < len(lines): # Go through each line > linestart = linestart + [addr] > length = len(lines[i]) > lineend = lineend + [addr + (length-1)] addr = addr + length > i = i + 1 Complicated and confusing and not the way to do it in Python. Something like this is much simpler: linetypes = [] # note plural inputfile = open('sample.pdf','rb') # Don't use file, use open. for line_number, line in enumerate(inputfile): # Process one line at a time. No need for that nonsense with manually # tracked line numbers, enumerate() does that for us. # No need to initialise linetypes. status = 'normal' i = line.find(' obj') if i >= 0: print "Object found at offset %d in line %d" % (i, line_number) status = 'object' i = line.find('endobj') if i >= 0: print "endobj found at offset %d in line %d" % (i, line_number) if status == 'normal': status = 'endobj' else: status = 'object & endobj' # both found on the one line linetypes.append(status) # What if obj or endobj exist more than once in a line? One last thing... if PDF files are a binary format, what makes you think that they can be processed line-by-line? They may not have lines, except by accident. -- Steven -- http://mail.python.org/mailman/listinfo/python-list