[ https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17865143#comment-17865143 ]
Tilman Hausherr edited comment on TIKA-4277 at 7/11/24 7:00 PM: ---------------------------------------------------------------- You should add / integrate something like this: {code:xml} <?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"> <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/> </parser> <parser class="org.apache.tika.parser.pdf.PDFParser"> <params> <param name="DetectAngles" type="bool">true</param> </params> </parser> </parsers> </properties> {code} was (Author: tilman): You should add / integrate something like this: {code:xml} <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"> <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/> </parser> <parser class="org.apache.tika.parser.pdf.PDFParser"> <params> <param name="DetectAngles" type="bool">true</param> </params> </parser> </parsers> </properties> {code} > PDF parse issue for text rotated > -------------------------------- > > Key: TIKA-4277 > URL: https://issues.apache.org/jira/browse/TIKA-4277 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-server > Affects Versions: 3.0.0-BETA, 2.9.2 > Reporter: ragebear > Priority: Major > Attachments: OtherPDFReader.png, sample2.pdf > > > the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta > The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in > server version and the standalone. > if the text is rotated 90. The parsed result will have a line break after > each letter of word. It happened to symbol, English letters, and JCK > characters. > In the server version, curl -g -T "sample2.pdf" > [http://localhost:889/tika] > --header "Accept: text/plain" > In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" > --text > Both of above, deliver the the incorrect result in the attached pdf. > The output result is below > i > n > s > e > r > t > > t > e > x > t > > p > r > o > b > l > e > m > insert text problem -- This message was sent by Atlassian Jira (v8.20.10#820010)