[jira] [Created] (PDFBOX-2584) Text extraction reports zero character widths

Pavel Misurkin (JIRA) Thu, 25 Dec 2014 03:06:58 -0800

Pavel Misurkin created PDFBOX-2584:
--------------------------------------

             Summary: Text extraction reports zero character widths 
                 Key: PDFBOX-2584
                 URL: https://issues.apache.org/jira/browse/PDFBOX-2584
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.8.8, 2.0.0
            Reporter: Pavel Misurkin



We are using PDFBox API to get position of characters within a document
Have found a problem with one document:: text extraction properly extracting 
text but set all character's width to zero

Code is pretty simple
{code}
            File input = new File("stip_2c.pdf");
            document = PDDocument.load(input);
            
            PDFTextStripper extractor = new PDFTextStripper();
            Writer output = new StringWriter();

            extractor.writeText(document, output);
{code}

We are examining then value of Extractor.charactersByArticle member for 
characters widths

- Have found the issue in 1.8.4
all chars widths were == zero

- in version 1.8.8
all chars widths were == zero except whitespaces.
See new validation added in 1.8.8
File 
pdfbox-1.8.8-src\pdfbox\src\main\java\org\apache\pdfbox\util\PDFStreamEngine.java
line 369
{code}        if (spaceWidthText == 0)
        {
            spaceWidthText = 1.0f; // if could not find font, use a generic 
value
        }        {code}

- in version 2.0.0 problem still exists




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PDFBOX-2584) Text extraction reports zero character widths

Reply via email to