[ https://issues.apache.org/jira/browse/TIKA-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157602#comment-13157602 ]
Michael McCandless commented on TIKA-723: ----------------------------------------- The sortByPosition option is tricky to default "properly" since it's very much dependent on whether you are using the resulting text/xhtml to 1) feed into a search engine (in which case, at least for the 2-column type of PDFs, you don't want to sort by position), or 2) rendering to something a user will directly look at (in which case I think you do want to sort by position, to have better "fidelity" with what the document looks like when viewed in a PDF viewer). The default has flipped back and forth recently... and is currently off, but with TIKA-612 you can now set it directly on your PDFParser instance. > Rotated text isn't extracted correctly from PDFs > ------------------------------------------------ > > Key: TIKA-723 > URL: https://issues.apache.org/jira/browse/TIKA-723 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Michael McCandless > Priority: Minor > Attachments: rotated.pdf > > > I have an example PDF with 90 degree rotation; Tika produces the > characters one line at a time. Ie, the doc has "Some rotated text, > here!" but Tika produces this: > {noformat} > <body><div class="page"><p>So > m > e > > r > o > t > a > t > e > d > > t > e > x > t > , > > h > e > r > e > !</p> > {noformat} > I'm able to copy/paste the text out correctly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira