[ https://issues.apache.org/jira/browse/TIKA-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Franz Canaval resolved TIKA-796. -------------------------------- Resolution: Duplicate Duplicate of https://issues.apache.org/jira/browse/TIKA-723 > Tika breaks words of rotated text in PDF documents > -------------------------------------------------- > > Key: TIKA-796 > URL: https://issues.apache.org/jira/browse/TIKA-796 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.10, 1.0 > Environment: Windows 7 Professional x64, Java(TM) SE Runtime > Environment (build 1.6.0_25-b06), Java HotSpot(TM) 64-Bit Server VM (build > 20.0-b11, mixed mode) > Reporter: Franz Canaval > Labels: broken, linefeed, pdf, rotated, text, words > > When Tika extracts text from a PDF file, *rotated texts are extracted in a > way that words are broken.* Apparently the number of lines of a rotated > paragraph seems to be the number of characters after which Tika breaks the > words apart with a line feed (0x0a) character. > Steps to reproduce this issue (in this example, on a Windows machine): > * Download the following pdf file: > [http://www.verbraucherzentrale-rlp.de/mediabig/115471A.pdf], e.g. to C:\temp\ > * Open a console window and run tika with: {{java -jar tika-app.jar -t > "file:///c:/temp/energieberatung.pdf" > test.txt}} > * Have a look at the text file, e.g. with a hex editor and note the words > broken in 2-character-pieces: {{<char1><char2><LF>}} > *This problems seems to be introduced with Tika 0.10, it still exists with > Tika 1.0.* -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira