[jira] [Commented] (TIKA-723) Rotated text isn't extracted correctly from PDFs

John Mastarone (Commented) (JIRA) Thu, 24 Nov 2011 18:59:14 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13156968#comment-13156968
 ]


John Mastarone commented on TIKA-723:
-------------------------------------

With the latest source, I tried adding the line         
"if (parser instanceof org.apache.tika.parser.pdf.PDFParser){ 
((org.apache.tika.parser.pdf.PDFParser)parser).setSortByPosition(true);}"
to the CompositeParser class, inside the parse method, right after the line 
"Parser parser = getParser(metadata);" and also had to add tika-parser as a 
dependency to the core. Then after building the core jar and tika-app, the text 
was no longer inappropriately vertical when using the GUI.  It appeared that 
none of the other PDFs in the test-resources folder were being parsed 
incorrectly, except for the first one (testAnnotations.pdf) which fails to 
parse entirely--but it also fails to parse with an unmodified, most-recent 
version of the Tika GUI, due to the same NPE in both cases.  I don't know if 
there's a JIRA item for this yet or not. Also, I downloaded the PDFBox 
application jar and ran ExtractText with the -sort option, and this properly 
rotated the text in your rotated.pdf file. 

After making the change to CompositeParser that I made, two test cases failed 
in tika-parsers, lines 147 and 180 of PDFParserTest.java which concern 
testPDFTwoTextBoxes.pdf and a table in testPDFVarious.pdf.  However, the 
assertions made in these lines are arguably up for interpretation: should the 
tika pdf parser really print all of the items in a column before moving onto 
the next column?  The change I made results in all elements of a given row 
being printed before moving onto the next row (row major order instead of 
column major).  This could be fine for the table in testPDFVarious.pdf, but 
maybe less so for the two text boxes in the other PDF?

So, I'm not experienced with Tika development at all, but perhaps a line (or 
lines) like the one above should be somewhere in the code--if not in the 
CompositeParser, then elsewhere, depending on what you and/or others think 
about the test cases that would fail as a result.  
                
> Rotated text isn't extracted correctly from PDFs
> ------------------------------------------------
>
>                 Key: TIKA-723
>                 URL: https://issues.apache.org/jira/browse/TIKA-723
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: rotated.pdf
>
>
> I have an example PDF with 90 degree rotation; Tika produces the
> characters one line at a time.  Ie, the doc has "Some rotated text,
> here!" but Tika produces this:
> {noformat}
> <body><div class="page"><p>So
> m
> e
>  
> r
> o
> t
> a
> t
> e
> d
>  
> t
> e
> x
> t
> ,
>  
> h
> e
> r
> e
> !</p>
> {noformat}
> I'm able to copy/paste the text out correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-723) Rotated text isn't extracted correctly from PDFs

Reply via email to