[
https://issues.apache.org/jira/browse/PDFBOX-878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David RodrÃguez Alfayate updated PDFBOX-878:
--------------------------------------------
Attachment: pdfbox-word-rotation.patch
Patch for current codebase, which solves text-rotation described in this bug.
It modifies 0,0 asumption as upper-left, adds a Word class in order to merge
several TextPosition in a consistent way, and adds a sample XML extraction
> Incorrect text extraction when text rotation is not 0,90,180,270
> ----------------------------------------------------------------
>
> Key: PDFBOX-878
> URL: https://issues.apache.org/jira/browse/PDFBOX-878
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.3.1, 1.4.0
> Reporter: David RodrÃguez Alfayate
> Attachments: pdfbox-word-rotation.patch, rotation.pdf,
> rotation_failure.txt
>
>
> Currently text extraction only supports 0, 90, 180 or 270 degrees rotation,
> so text rotated in another angle is extracted incorrectly. I attached one
> simple PDF and the text extraction result as output from ExtractText.
> I have made a patch for the current revision (1.4.0) in which I consider any
> rotation in the current matrix position. I have had to refactor the
> considering of (0,0) as upper-left since for rotations places outside of the
> original asumption, could happen that a word or a line could be splitted.
> Since we have some needs for a project in we are working, I have made changes
> to the way normalization and line printing is done, in the current codebase
> the normalize function is returing a List of Strings, my changes make this
> method return a List of Words, which are ICU normalized and therefore printed
> in the current writeLine method. In my patch I have also included a sample
> PDF2XML class, which converts the PDF to a XML, managing each word in a
> separate way.
> I submit the test-cases and the patch for your consideration.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.