[ https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523870#comment-15523870 ]
ASF GitHub Bot commented on TIKA-2098: -------------------------------------- GitHub user alexshadow007 opened a pull request: https://github.com/apache/tika/pull/134 fix for TIKA-2098 contributed by alexshadow007 You can merge this pull request into a Git repository by running: $ git pull https://github.com/alexshadow007/tika TIKA-2098 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/134.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #134 ---- commit c33ac04618f97c06fe4508b5d41465b2c11ba1b9 Author: Alexander Kazakov <alexshadow...@gmail.com> Date: 2016-09-26T18:48:11Z fix for TIKA-2098 contributed by alexshadow007 ---- > Tika.parseToString() with maxLength doesn't work correctly for PDF files > ------------------------------------------------------------------------ > > Key: TIKA-2098 > URL: https://issues.apache.org/jira/browse/TIKA-2098 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.13 > Reporter: Alexander Kazakov > Labels: java, parser, pdf > > When parsing PDF file with Tika.parseToString(InputStream stream, Metadata > metadata, int maxLength) and maxLength < content size it throws Exception. > {code:java} > org.apache.tika.exception.TikaException: Unable to extract all PDF content > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:568) > Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a > string: Tika - Content Analysis Toolkit > at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302) > at > org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779) > at > org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738) > at > org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672) > at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392) > at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143) > at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111) > ... 35 more > Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained > more than 100 characters, and so your requested limit has been reached. To > receive the full text of the document, increase your limit. (Text up to the > limit is however available). > org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 > characters, and so your requested limit has been reached. To receive the full > text of the document, increase your limit. (Text up to the limit is however > available). > org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your > document contained more than 100 characters, and so your requested limit has > been reached. To receive the full text of the document, increase your limit. > (Text up to the limit is however available). > at > org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > at > org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46) > at > org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82) > at > org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140) > at > org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306) > at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300) > ... 43 more > Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained > more than 100 characters, and so your requested limit has been reached. To > receive the full text of the document, increase your limit. (Text up to the > limit is however available). > org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your > document contained more than 100 characters, and so your requested limit has > been reached. To receive the full text of the document, increase your limit. > (Text up to the limit is however available). > at > org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > ... 51 more > Caused by: > org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your > document contained more than 100 characters, and so your requested limit has > been reached. To receive the full text of the document, increase your limit. > (Text up to the limit is however available). > at > org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > at > org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > at > org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > ... 52 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)