[ 
https://issues.apache.org/jira/browse/TIKA-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160841#comment-13160841
 ] 

Michael McCandless commented on TIKA-796:
-----------------------------------------

This looks like a dup of TIKA-723?

Note that with Tika 1.1 (not yet released) you can call 
PDFParser.setSortByPosition(true) and the rotated text should be extracted 
correctly (I just confirmed on this PDF).

However, that will also cause eg 2 columns to become "interleaved", usually not 
what you want if this text is going to be indexed into a search index.

I would love to fix PDFBox somehow to dynamically pick the right setting for 
the right chunk of text; often the rotated text arrives in the PDF as a single 
chunk of text and we could in theory extract it correctly even when 
setSortByPosition is false...
                
> Tika breaks words of rotated text in PDF documents
> --------------------------------------------------
>
>                 Key: TIKA-796
>                 URL: https://issues.apache.org/jira/browse/TIKA-796
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.10, 1.0
>         Environment: Windows 7 Professional x64, Java(TM) SE Runtime 
> Environment (build 1.6.0_25-b06), Java HotSpot(TM) 64-Bit Server VM (build 
> 20.0-b11, mixed mode)
>            Reporter: Franz Canaval
>              Labels: broken, linefeed, pdf, rotated, text, words
>
> When Tika extracts text from a PDF file, *rotated texts are extracted in a 
> way that words are broken.* Apparently the number of lines of a rotated 
> paragraph seems to be the number of characters after which Tika breaks the 
> words apart with a line feed (0x0a) character.
> Steps to reproduce this issue (in this example, on a Windows machine):
> * Download the following pdf file: 
> [http://www.verbraucherzentrale-rlp.de/mediabig/115471A.pdf], e.g. to C:\temp\
> * Open a console window and run tika with: {{java -jar tika-app.jar -t 
> "file:///c:/temp/energieberatung.pdf" > test.txt}}
> * Have a look at the text file, e.g. with a hex editor and note the words 
> broken in 2-character-pieces: {{<char1><char2><LF>}}
> *This problems seems to be introduced with Tika 0.10, it still exists with 
> Tika 1.0.*

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to