Vertical text extraction splitting text
---------------------------------------
Key: PDFBOX-358
URL: https://issues.apache.org/jira/browse/PDFBOX-358
Project: PDFBox
Issue Type: Improvement
Components: Text extraction
Reporter: Jukka Zitting
[Issue from SourceForge]
http://sourceforge.net/tracker/index.php?func=detail&aid=1981851&group_id=78314&atid=552832
Vertical text gets splitted during extraction using PDFTextStripper.
"Specification" gives:
Spécif
ic
ations
This is made worse when sorted by position, as it gets mixed up with the
horizontal text:
ic
ations
[CLASSIFIED INFO]
[CLASSIFIED INFO]
Spécif [CLASSIFIED INFO]
[CLASSIFIED INFO]
I'm afraid I can't provide the PDF in question due to confidentiality
requirements. It's a PDF obtained from the conversion to PDF of a Windows
Word document. According to the forums I'm not the only one with this
problem.
[Comment on SourceForge]
Date: 2008-06-02 09:11
Sender: totoll
Logged In: YES
user_id=2096423
Originator: YES
To clarify, the text in question is rotated by 90° counter-clockwise.Date:
2008-06-02 10:30
[Comment on SourceForge]
Sender: totoll
Logged In: YES
user_id=2096423
Originator: YES
I have attached an admittedly very complicated PDF document which (as far
as I can tell) features 90° and 135° rotated text in a 90° rotated page.
Position-ordered text extraction gives horrible results.
Normal text extraction is also very messy, although in this second case
the results are almost understandable.
This is not the document I need to treat, but i think that if text can be
correctly extracted from that PDF, it should work for almost every other
existing PDF.
File Added: Flyer2.pdf
http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=279847&aid=1981851
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.