[jira] [Comment Edited] (TIKA-4277) PDF parse issue for text rotated

Tilman Hausherr (Jira) Thu, 11 Jul 2024 12:01:16 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17865143#comment-17865143
 ]


Tilman Hausherr edited comment on TIKA-4277 at 7/11/24 7:00 PM:
----------------------------------------------------------------

You should add / integrate something like this:
{code:xml}
<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
            <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
        </parser>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="DetectAngles" type="bool">true</param>
            </params>
        </parser>
    </parsers>
</properties>
{code}


was (Author: tilman):
You should add / integrate something like this:
{code:xml}
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
            <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
        </parser>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="DetectAngles" type="bool">true</param>
            </params>
        </parser>
    </parsers>
</properties>
{code}

> PDF parse issue for text rotated
> --------------------------------
>
>                 Key: TIKA-4277
>                 URL: https://issues.apache.org/jira/browse/TIKA-4277
>             Project: Tika
>          Issue Type: Bug
>          Components: tika-app, tika-server
>    Affects Versions: 3.0.0-BETA, 2.9.2
>            Reporter: ragebear
>            Priority: Major
>         Attachments: OtherPDFReader.png, sample2.pdf
>
>
> the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta
> The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in 
> server version and the standalone.
> if the text is rotated 90. The parsed result will have a line break after 
> each letter of word. It happened to symbol, English letters, and JCK 
> characters.
> In the server version, curl -g -T "sample2.pdf" 
> [http://localhost:889/tika]
> --header "Accept: text/plain"
> In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" 
> --text
> Both of above, deliver the the incorrect result in the attached pdf.
> The output result is below
> i
> n
> s
> e
> r
> t
>  
> t
> e
> x
> t
>  
> p
> r
> o
> b
> l
> e
> m
> insert text problem



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (TIKA-4277) PDF parse issue for text rotated

Reply via email to