Alexander Kazakov created TIKA-2098:
---------------------------------------

             Summary: Tika.parseToString() with maxLength doesn't work 
correctly for PDF files
                 Key: TIKA-2098
                 URL: https://issues.apache.org/jira/browse/TIKA-2098
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.13
            Reporter: Alexander Kazakov


When parsing PDF file with Tika.parseToString(InputStream stream, Metadata 
metadata, int maxLength) and maxLength < content size it throws Exception.

{code:java}
org.apache.tika.exception.TikaException: Unable to extract all PDF content

        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.Tika.parseToString(Tika.java:568)
Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a 
string: Tika - Content Analysis Toolkit
        at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302)
        at 
org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
        at 
org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
        at 
org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
        at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
        at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
        at 
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
        at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
        ... 35 more
Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained more 
than 100 characters, and so your requested limit has been reached. To receive 
the full text of the document, increase your limit. (Text up to the limit is 
however available).
org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 
characters, and so your requested limit has been reached. To receive the full 
text of the document, increase your limit. (Text up to the limit is however 
available).
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
document contained more than 100 characters, and so your requested limit has 
been reached. To receive the full text of the document, increase your limit. 
(Text up to the limit is however available).
        at 
org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
        at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
        at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
        at 
org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
        at 
org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
        at 
org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
        at 
org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
        at 
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
        at 
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
        at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300)
        ... 43 more
Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained more 
than 100 characters, and so your requested limit has been reached. To receive 
the full text of the document, increase your limit. (Text up to the limit is 
however available).
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
document contained more than 100 characters, and so your requested limit has 
been reached. To receive the full text of the document, increase your limit. 
(Text up to the limit is however available).
        at 
org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
        at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
        at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
        ... 51 more
Caused by: 
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
document contained more than 100 characters, and so your requested limit has 
been reached. To receive the full text of the document, increase your limit. 
(Text up to the limit is however available).
        at 
org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)
        at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
        at 
org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
        at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
        at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
        at 
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
        at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
        ... 52 more
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to