Bug#563443: python-pypdf: parsing not robust to whitespace

John V. Belmonte Sat, 02 Jan 2010 16:11:31 -0800

Package: python-pypdf
Version: 1.12-2
Severity: normal

While using pdfshuffler on PDF statements from my stock broker, on export I'd
consistently get an exception from pypdf.  Note that pdfshuffler's own display,
along with evince, acroread, kpdf, etc. have no problem with these documents.


On inspection it turns out that pypdf's parsing is rather primitive
and doesn't handle the presence of extra spaces, linefeeds in place of
space, etc.  Here is an example of PDF source causing problems:

   9     0 obj
   <<
   /Type /Font
   /Subtype /Type1
   /Encoding 4          0 R
   /BaseFont /Times-Bold
   >> endobj

I will attach a patch that makes parsing more lax about whitespace in a few
places that were significant to my document.  However this is just the tip
of the iceburg.  Unfortunatley the pypdf code is written in a rather low-level
fashion and addressing the problem fully will be a large task.

-- System Information:
Debian Release: squeeze/sid
  APT prefers testing
  APT policy: (500, 'testing')
Architecture: i386 (i686)

Kernel: Linux 2.6.30-2-686 (SMP w/1 CPU core)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages python-pypdf depends on:
ii  python-support                1.0.6      automated rebuilding support for P

python-pypdf recommends no packages.

python-pypdf suggests no packages.

-- no debconf information

-- debsums errors found:
debsums: changed file /usr/share/python-support/python-pypdf/pyPdf/pdf.py (from 
python-pypdf package)
debsums: changed file /usr/share/python-support/python-pypdf/pyPdf/generic.py 
(from python-pypdf package)

# patch to pypdf to tolerate whitespace in cases like this
# (generated by Exstream Dialogue 6.1.015):
#
#   9     0 obj
#   <<
#   /Type /Font
#   /Subtype /Type1
#   /Encoding 4          0 R
#   /BaseFont /Times-Bold
#   >> endobj

diff -urb orig/generic.py new/generic.py
--- orig/generic.py	2009-12-29 23:09:18.556359182 -0500
+++ new/generic.py	2009-12-29 23:07:10.780361180 -0500
@@ -35,7 +35,7 @@
 __author_email__ = "biz...@mathieu.fenniak.net"

 import re
-from utils import readNonWhitespace, RC4_encrypt
+from utils import readNonWhitespace, readUntilWhitespace, RC4_encrypt
 import filters
 import utils
 import decimal
@@ -81,7 +81,7 @@
             return NumberObject.readFromStream(stream)
         peek = stream.read(20)
         stream.seek(-len(peek), 1) # reset to start
-        if re.match(r"(\d+)\s(\d+)\sR[^a-zA-Z]", peek) != None:
+        if re.match(r"(\d+)\s+(\d+)\sR[^a-zA-Z]", peek) != None:
             return IndirectObject.readFromStream(stream, pdf)
         else:
             return NumberObject.readFromStream(stream)
@@ -183,19 +183,10 @@
         stream.write("%s %s R" % (self.idnum, self.generation))

     def readFromStream(stream, pdf):
-        idnum = ""
-        while True:
-            tok = stream.read(1)
-            if tok.isspace():
-                break
-            idnum += tok
-        generation = ""
-        while True:
-            tok = stream.read(1)
-            if tok.isspace():
-                break
-            generation += tok
-        r = stream.read(1)
+        idnum = readUntilWhitespace(stream)
+        readNonWhitespace(stream); stream.seek(-1, 1)
+        generation = readUntilWhitespace(stream)
+        r = readNonWhitespace(stream)
         if r != "R":
             raise utils.PdfReadError("error reading indirect object reference")
         return IndirectObject(int(idnum), int(generation), pdf)
diff -urb orig/pdf.py new/pdf.py
--- orig/pdf.py	2009-12-29 23:09:17.632359905 -0500
+++ new/pdf.py	2009-12-29 23:11:00.444359823 -0500
@@ -586,10 +586,13 @@
         # tables that are off by whitespace bytes.
         readNonWhitespace(stream); stream.seek(-1, 1)
         idnum = readUntilWhitespace(stream)
+        readNonWhitespace(stream); stream.seek(-1, 1)
         generation = readUntilWhitespace(stream)
-        obj = stream.read(3)
-        readNonWhitespace(stream)
-        stream.seek(-1, 1)
+        readNonWhitespace(stream); stream.seek(-1, 1)
+        obj_token = stream.read(3)
+        if obj_token != 'obj':
+            raise utils.PdfReadError("Error reading object header")
+        readNonWhitespace(stream); stream.seek(-1, 1)
         return int(idnum), int(generation)

     def cacheIndirectObject(self, generation, idnum, obj):

Bug#563443: python-pypdf: parsing not robust to whitespace

Reply via email to