Package: python-pypdf Version: 1.12-2 Severity: normal While using pdfshuffler on PDF statements from my stock broker, on export I'd consistently get an exception from pypdf. Note that pdfshuffler's own display, along with evince, acroread, kpdf, etc. have no problem with these documents.
On inspection it turns out that pypdf's parsing is rather primitive and doesn't handle the presence of extra spaces, linefeeds in place of space, etc. Here is an example of PDF source causing problems: 9 0 obj << /Type /Font /Subtype /Type1 /Encoding 4 0 R /BaseFont /Times-Bold >> endobj I will attach a patch that makes parsing more lax about whitespace in a few places that were significant to my document. However this is just the tip of the iceburg. Unfortunatley the pypdf code is written in a rather low-level fashion and addressing the problem fully will be a large task. -- System Information: Debian Release: squeeze/sid APT prefers testing APT policy: (500, 'testing') Architecture: i386 (i686) Kernel: Linux 2.6.30-2-686 (SMP w/1 CPU core) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Versions of packages python-pypdf depends on: ii python-support 1.0.6 automated rebuilding support for P python-pypdf recommends no packages. python-pypdf suggests no packages. -- no debconf information -- debsums errors found: debsums: changed file /usr/share/python-support/python-pypdf/pyPdf/pdf.py (from python-pypdf package) debsums: changed file /usr/share/python-support/python-pypdf/pyPdf/generic.py (from python-pypdf package)
# patch to pypdf to tolerate whitespace in cases like this # (generated by Exstream Dialogue 6.1.015): # # 9 0 obj # << # /Type /Font # /Subtype /Type1 # /Encoding 4 0 R # /BaseFont /Times-Bold # >> endobj diff -urb orig/generic.py new/generic.py --- orig/generic.py 2009-12-29 23:09:18.556359182 -0500 +++ new/generic.py 2009-12-29 23:07:10.780361180 -0500 @@ -35,7 +35,7 @@ __author_email__ = "[email protected]" import re -from utils import readNonWhitespace, RC4_encrypt +from utils import readNonWhitespace, readUntilWhitespace, RC4_encrypt import filters import utils import decimal @@ -81,7 +81,7 @@ return NumberObject.readFromStream(stream) peek = stream.read(20) stream.seek(-len(peek), 1) # reset to start - if re.match(r"(\d+)\s(\d+)\sR[^a-zA-Z]", peek) != None: + if re.match(r"(\d+)\s+(\d+)\sR[^a-zA-Z]", peek) != None: return IndirectObject.readFromStream(stream, pdf) else: return NumberObject.readFromStream(stream) @@ -183,19 +183,10 @@ stream.write("%s %s R" % (self.idnum, self.generation)) def readFromStream(stream, pdf): - idnum = "" - while True: - tok = stream.read(1) - if tok.isspace(): - break - idnum += tok - generation = "" - while True: - tok = stream.read(1) - if tok.isspace(): - break - generation += tok - r = stream.read(1) + idnum = readUntilWhitespace(stream) + readNonWhitespace(stream); stream.seek(-1, 1) + generation = readUntilWhitespace(stream) + r = readNonWhitespace(stream) if r != "R": raise utils.PdfReadError("error reading indirect object reference") return IndirectObject(int(idnum), int(generation), pdf) diff -urb orig/pdf.py new/pdf.py --- orig/pdf.py 2009-12-29 23:09:17.632359905 -0500 +++ new/pdf.py 2009-12-29 23:11:00.444359823 -0500 @@ -586,10 +586,13 @@ # tables that are off by whitespace bytes. readNonWhitespace(stream); stream.seek(-1, 1) idnum = readUntilWhitespace(stream) + readNonWhitespace(stream); stream.seek(-1, 1) generation = readUntilWhitespace(stream) - obj = stream.read(3) - readNonWhitespace(stream) - stream.seek(-1, 1) + readNonWhitespace(stream); stream.seek(-1, 1) + obj_token = stream.read(3) + if obj_token != 'obj': + raise utils.PdfReadError("Error reading object header") + readNonWhitespace(stream); stream.seek(-1, 1) return int(idnum), int(generation) def cacheIndirectObject(self, generation, idnum, obj):
_______________________________________________ Python-modules-team mailing list [email protected] http://lists.alioth.debian.org/mailman/listinfo/python-modules-team

