Tika breaks words of rotated text in PDF documents
--------------------------------------------------

                 Key: TIKA-796
                 URL: https://issues.apache.org/jira/browse/TIKA-796
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.0, 0.10
         Environment: Windows 7 Professional x64, Java(TM) SE Runtime 
Environment (build 1.6.0_25-b06), Java HotSpot(TM) 64-Bit Server VM (build 
20.0-b11, mixed mode)
            Reporter: Franz Canaval


When Tika extracts text from a PDF file, *rotated texts are extracted in a way 
that words are broken.* Apparently the number of lines of a rotated paragraph 
seems to be the number of characters after which Tika breaks the words apart 
with a line feed (0x0a) character.

Steps to reproduce this issue (in this example, on a Windows machine):
* Download the following pdf file: 
[http://www.verbraucherzentrale-rlp.de/mediabig/115471A.pdf], e.g. to C:\temp\
* Open a console window and run tika with: {{java -jar tika-app.jar -t 
"file:///c:/temp/energieberatung.pdf" > test.txt}}
* Have a look at the text file, e.g. with a hex editor and note the words 
broken in 2-character-pieces: {{<char1><char2><LF>}}

*This problems seems to be introduced with Tika 0.10, it still exists with Tika 
1.0.*

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to