Uziel Sulkies created PDFBOX-4553:
-------------------------------------

             Summary: Break of backward compatibility from 2.0.14 to 2.0.15
                 Key: PDFBOX-4553
                 URL: https://issues.apache.org/jira/browse/PDFBOX-4553
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 2.0.15
            Reporter: Uziel Sulkies
         Attachments: KYPolicy2.pdf

We use PDFTextStripper to parse some PDF documents. The parsing sometimes 
assumes the file template and the order of the words in it.

The following Kotlin code prints the text content of the attached file, sorted 
by position.
{code:java}
fun main() {
  val pdfTextStripper = PDFTextStripper()
  pdfTextStripper.sortByPosition = true
  val text = 
pdfTextStripper.getText(PDDocument.load(File("/path/to/file/KYPolicy2.pdf").readBytes()))
  print(text)
}
{code}
Running this code with PDFBox 2.0.14 and 2.0.15 giving different parsing for 
the line 
{quote}POLICY PERIOD:  FROM 02/18/2018 TO 02/18/2019 (2.0.14)

POLICY PERIOD:  FROM 02/18/2018 02/18/2019TO (2.0.15)
{quote}
I suspect the cause is the changes done in this commit:

[https://github.com/apache/pdfbox/commit/068146a9c9fe59becbd82814b6a245f8158fce22]

 

This somehow prevents us for safely upgrading to the newer version

[^KYPolicy2.pdf]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to