[
https://issues.apache.org/jira/browse/PDFBOX-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr closed PDFBOX-1805.
-----------------------------------
Resolution: Cannot Reproduce
I'm unable to reproduce this, see attached file. Did you use any non standard
parameters? I'm closing this, please comment and/or reopen.
> PDFTextStripper, add word segment even if the last word is a space
> ------------------------------------------------------------------
>
> Key: PDFBOX-1805
> URL: https://issues.apache.org/jira/browse/PDFBOX-1805
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.3
> Reporter: Andy Phillips
> Priority: Major
> Attachments: 36048C3D-54B9-4862-91AA-C94B33C11027.pdf, PDFBOX-1805.txt
>
>
> I found that, in some PDFs, not injecting a WordSpacing in a line that is
> greater than expected for a space in the "line" normalization, causes text
> "fields" that should be separated (as they are not really part of the
> paragraph) to be improperly added to the line of text.
> In the attached pdf, i have found that looking at the first line of the first
> violation of code, that the "Corrected By" date is incorrectly added to the
> same line of Description of Violation. This is due to the fact that the
> first line of "Description of Violation" ends with a space. This is due to
> word wrapping of the paragraph when it was generated and i believe that if
> the next letter in the line is greater than an expected space, regardless if
> the last line ends in a space, it should be considered a second segment.
> I suggest removing the following change in PDFTextStripper file (i commented
> out the last two requirements from the if statement):
> {code}
> //Test if our TextPosition starts after a new word would
> be expected to start.
> if (expectedStartOfNextWordX !=
> EXPECTEDSTARTOFNEXTWORDX_RESET_VALUE
> && expectedStartOfNextWordX < positionX) /* &&
> //only bother adding a space if the last
> character was not a space
> lastPosition.getTextPosition().getCharacter() !=
> null &&
>
> !lastPosition.getTextPosition().getCharacter().endsWith( " " ) ) */
> {
> line.add(WordSeparator.getSeparator());
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]