[ 
https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523870#comment-15523870
 ] 

ASF GitHub Bot commented on TIKA-2098:
--------------------------------------

GitHub user alexshadow007 opened a pull request:

    https://github.com/apache/tika/pull/134

    fix for TIKA-2098 contributed by alexshadow007

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/alexshadow007/tika TIKA-2098

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/tika/pull/134.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #134
    
----
commit c33ac04618f97c06fe4508b5d41465b2c11ba1b9
Author: Alexander Kazakov <alexshadow...@gmail.com>
Date:   2016-09-26T18:48:11Z

    fix for TIKA-2098 contributed by alexshadow007

----


> Tika.parseToString() with maxLength doesn't work correctly for PDF files
> ------------------------------------------------------------------------
>
>                 Key: TIKA-2098
>                 URL: https://issues.apache.org/jira/browse/TIKA-2098
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.13
>            Reporter: Alexander Kazakov
>              Labels: java, parser, pdf
>
> When parsing PDF file with Tika.parseToString(InputStream stream, Metadata 
> metadata, int maxLength) and maxLength < content size it throws Exception.
> {code:java}
> org.apache.tika.exception.TikaException: Unable to extract all PDF content
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135)
>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>       at org.apache.tika.Tika.parseToString(Tika.java:568)
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a 
> string: Tika - Content Analysis Toolkit
>       at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
>       at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
>       ... 35 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 
> characters, and so your requested limit has been reached. To receive the full 
> text of the document, increase your limit. (Text up to the limit is however 
> available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however available).
>       at 
> org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       at 
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
>       at 
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
>       at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
>       at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
>       at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
>       at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
>       at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300)
>       ... 43 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however available).
>       at 
> org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       ... 51 more
> Caused by: 
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however available).
>       at 
> org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       at 
> org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       ... 52 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to