[jira] [Commented] (TIKA-2098) Tika.parseToString() with maxLength doesn't work correctly for PDF files
[ https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15642937#comment-15642937 ] Hudson commented on TIKA-2098: -- FAILURE: Integrated in Jenkins build tika-2.x-windows #69 (See [https://builds.apache.org/job/tika-2.x-windows/69/]) improve unit test for TIKA-2098 (tallison: rev 6ca74bec6a1d448bbe3340d51dc84ca8ca58507a) * (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java > Tika.parseToString() with maxLength doesn't work correctly for PDF files > > > Key: TIKA-2098 > URL: https://issues.apache.org/jira/browse/TIKA-2098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 >Reporter: Alexander Kazakov >Assignee: Tim Allison > Labels: java, parser, pdf > Fix For: 2.0, 1.14 > > > When parsing PDF file with Tika.parseToString(InputStream stream, Metadata > metadata, int maxLength) and maxLength < content size it throws Exception. > {code:java} > org.apache.tika.exception.TikaException: Unable to extract all PDF content > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:568) > Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a > string: Tika - Content Analysis Toolkit > at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302) > at > org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779) > at > org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738) > at > org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672) > at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392) > at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143) > at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111) > ... 35 more > Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained > more than 100 characters, and so your requested limit has been reached. To > receive the full text of the document, increase your limit. (Text up to the > limit is however available). > org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 > characters, and so your requested limit has been reached. To receive the full > text of the document, increase your limit. (Text up to the limit is however > available). > org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your > document contained more than 100 characters, and so your requested limit has > been reached. To receive the full text of the document, increase your limit. > (Text up to the limit is however available). > at > org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > at > org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46) > at > org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82) > at > org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140) > at > org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306) > at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300) > ... 43 more > Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained > more than 100 characters, and so your requested limit has been reached. To > receive the full text of the document, increase your limit. (Text up to the > limit is however available). > org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your > document contained more than 100 characters, and so your requested limit has > been reached. To receive the full text of the document, increase your limit. > (Text up to the
[jira] [Commented] (TIKA-2098) Tika.parseToString() with maxLength doesn't work correctly for PDF files
[ https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15626453#comment-15626453 ] Hudson commented on TIKA-2098: -- FAILURE: Integrated in Jenkins build tika-2.x #169 (See [https://builds.apache.org/job/tika-2.x/169/]) improve unit test for TIKA-2098 (tallison: rev 6ca74bec6a1d448bbe3340d51dc84ca8ca58507a) * (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java > Tika.parseToString() with maxLength doesn't work correctly for PDF files > > > Key: TIKA-2098 > URL: https://issues.apache.org/jira/browse/TIKA-2098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 >Reporter: Alexander Kazakov >Assignee: Tim Allison > Labels: java, parser, pdf > Fix For: 2.0, 1.14 > > > When parsing PDF file with Tika.parseToString(InputStream stream, Metadata > metadata, int maxLength) and maxLength < content size it throws Exception. > {code:java} > org.apache.tika.exception.TikaException: Unable to extract all PDF content > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:568) > Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a > string: Tika - Content Analysis Toolkit > at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302) > at > org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779) > at > org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738) > at > org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672) > at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392) > at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143) > at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111) > ... 35 more > Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained > more than 100 characters, and so your requested limit has been reached. To > receive the full text of the document, increase your limit. (Text up to the > limit is however available). > org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 > characters, and so your requested limit has been reached. To receive the full > text of the document, increase your limit. (Text up to the limit is however > available). > org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your > document contained more than 100 characters, and so your requested limit has > been reached. To receive the full text of the document, increase your limit. > (Text up to the limit is however available). > at > org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > at > org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46) > at > org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82) > at > org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140) > at > org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306) > at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300) > ... 43 more > Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained > more than 100 characters, and so your requested limit has been reached. To > receive the full text of the document, increase your limit. (Text up to the > limit is however available). > org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your > document contained more than 100 characters, and so your requested limit has > been reached. To receive the full text of the document, increase your limit. > (Text up to the limit is however
[jira] [Commented] (TIKA-2098) Tika.parseToString() with maxLength doesn't work correctly for PDF files
[ https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15626388#comment-15626388 ] Hudson commented on TIKA-2098: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1131 (See [https://builds.apache.org/job/Tika-trunk/1131/]) improve test for TIKA-2098 (tallison: rev 2df68c84b043f3158c0bdfa63d1a0c8d44d7e18a) * (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java > Tika.parseToString() with maxLength doesn't work correctly for PDF files > > > Key: TIKA-2098 > URL: https://issues.apache.org/jira/browse/TIKA-2098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 >Reporter: Alexander Kazakov >Assignee: Tim Allison > Labels: java, parser, pdf > Fix For: 2.0, 1.14 > > > When parsing PDF file with Tika.parseToString(InputStream stream, Metadata > metadata, int maxLength) and maxLength < content size it throws Exception. > {code:java} > org.apache.tika.exception.TikaException: Unable to extract all PDF content > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:568) > Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a > string: Tika - Content Analysis Toolkit > at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302) > at > org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779) > at > org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738) > at > org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672) > at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392) > at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143) > at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111) > ... 35 more > Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained > more than 100 characters, and so your requested limit has been reached. To > receive the full text of the document, increase your limit. (Text up to the > limit is however available). > org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 > characters, and so your requested limit has been reached. To receive the full > text of the document, increase your limit. (Text up to the limit is however > available). > org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your > document contained more than 100 characters, and so your requested limit has > been reached. To receive the full text of the document, increase your limit. > (Text up to the limit is however available). > at > org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > at > org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46) > at > org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82) > at > org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140) > at > org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306) > at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300) > ... 43 more > Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained > more than 100 characters, and so your requested limit has been reached. To > receive the full text of the document, increase your limit. (Text up to the > limit is however available). > org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your > document contained more than 100 characters, and so your requested limit has > been reached. To receive the full text of the document, increase your limit. > (Text up to the limit is however available). > at >
[jira] [Commented] (TIKA-2098) Tika.parseToString() with maxLength doesn't work correctly for PDF files
[ https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524149#comment-15524149 ] Hudson commented on TIKA-2098: -- SUCCESS: Integrated in Jenkins build Tika-trunk # (See [https://builds.apache.org/job/Tika-trunk//]) TIKA-2098 small clean up. Test for writelimitreached for each catchable (tallison: rev 9b497d1fef2fe183b2099f1a835113dade8a0227) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java > Tika.parseToString() with maxLength doesn't work correctly for PDF files > > > Key: TIKA-2098 > URL: https://issues.apache.org/jira/browse/TIKA-2098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 >Reporter: Alexander Kazakov >Assignee: Tim Allison > Labels: java, parser, pdf > Fix For: 2.0, 1.14 > > > When parsing PDF file with Tika.parseToString(InputStream stream, Metadata > metadata, int maxLength) and maxLength < content size it throws Exception. > {code:java} > org.apache.tika.exception.TikaException: Unable to extract all PDF content > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:568) > Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a > string: Tika - Content Analysis Toolkit > at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302) > at > org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779) > at > org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738) > at > org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672) > at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392) > at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143) > at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111) > ... 35 more > Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained > more than 100 characters, and so your requested limit has been reached. To > receive the full text of the document, increase your limit. (Text up to the > limit is however available). > org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 > characters, and so your requested limit has been reached. To receive the full > text of the document, increase your limit. (Text up to the limit is however > available). > org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your > document contained more than 100 characters, and so your requested limit has > been reached. To receive the full text of the document, increase your limit. > (Text up to the limit is however available). > at > org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > at > org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46) > at > org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82) > at > org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140) > at > org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306) > at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300) > ... 43 more > Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained > more than 100 characters, and so your requested limit has been reached. To > receive the full text of the document, increase your limit. (Text up to the > limit is however available). > org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your > document contained more than 100 characters, and so your requested limit has > been reached. To
[jira] [Commented] (TIKA-2098) Tika.parseToString() with maxLength doesn't work correctly for PDF files
[ https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524143#comment-15524143 ] Hudson commented on TIKA-2098: -- SUCCESS: Integrated in Jenkins build tika-2.x #153 (See [https://builds.apache.org/job/tika-2.x/153/]) TIKA-2098 small clean up. Test for writelimitreached for each catchable (tallison: rev cde4c0aa8b668e0964f2b83fab67588292ffc993) * (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java * (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java > Tika.parseToString() with maxLength doesn't work correctly for PDF files > > > Key: TIKA-2098 > URL: https://issues.apache.org/jira/browse/TIKA-2098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 >Reporter: Alexander Kazakov >Assignee: Tim Allison > Labels: java, parser, pdf > Fix For: 2.0, 1.14 > > > When parsing PDF file with Tika.parseToString(InputStream stream, Metadata > metadata, int maxLength) and maxLength < content size it throws Exception. > {code:java} > org.apache.tika.exception.TikaException: Unable to extract all PDF content > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:568) > Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a > string: Tika - Content Analysis Toolkit > at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302) > at > org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779) > at > org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738) > at > org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672) > at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392) > at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143) > at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111) > ... 35 more > Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained > more than 100 characters, and so your requested limit has been reached. To > receive the full text of the document, increase your limit. (Text up to the > limit is however available). > org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 > characters, and so your requested limit has been reached. To receive the full > text of the document, increase your limit. (Text up to the limit is however > available). > org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your > document contained more than 100 characters, and so your requested limit has > been reached. To receive the full text of the document, increase your limit. > (Text up to the limit is however available). > at > org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > at > org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46) > at > org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82) > at > org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140) > at > org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306) > at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300) > ... 43 more > Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained > more than 100 characters, and so your requested limit has been reached. To > receive the full text of the document, increase your limit. (Text up to the > limit is however available). > org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your > document contained more than
[jira] [Commented] (TIKA-2098) Tika.parseToString() with maxLength doesn't work correctly for PDF files
[ https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524060#comment-15524060 ] Hudson commented on TIKA-2098: -- FAILURE: Integrated in Jenkins build tika-2.x-windows #57 (See [https://builds.apache.org/job/tika-2.x-windows/57/]) TIKA-2098 small clean up. Test for writelimitreached for each catchable (tallison: rev cde4c0aa8b668e0964f2b83fab67588292ffc993) * (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java * (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java > Tika.parseToString() with maxLength doesn't work correctly for PDF files > > > Key: TIKA-2098 > URL: https://issues.apache.org/jira/browse/TIKA-2098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 >Reporter: Alexander Kazakov >Assignee: Tim Allison > Labels: java, parser, pdf > Fix For: 2.0, 1.14 > > > When parsing PDF file with Tika.parseToString(InputStream stream, Metadata > metadata, int maxLength) and maxLength < content size it throws Exception. > {code:java} > org.apache.tika.exception.TikaException: Unable to extract all PDF content > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:568) > Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a > string: Tika - Content Analysis Toolkit > at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302) > at > org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779) > at > org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738) > at > org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672) > at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392) > at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143) > at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111) > ... 35 more > Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained > more than 100 characters, and so your requested limit has been reached. To > receive the full text of the document, increase your limit. (Text up to the > limit is however available). > org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 > characters, and so your requested limit has been reached. To receive the full > text of the document, increase your limit. (Text up to the limit is however > available). > org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your > document contained more than 100 characters, and so your requested limit has > been reached. To receive the full text of the document, increase your limit. > (Text up to the limit is however available). > at > org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > at > org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46) > at > org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82) > at > org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140) > at > org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306) > at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300) > ... 43 more > Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained > more than 100 characters, and so your requested limit has been reached. To > receive the full text of the document, increase your limit. (Text up to the > limit is however available). > org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your > document
[jira] [Commented] (TIKA-2098) Tika.parseToString() with maxLength doesn't work correctly for PDF files
[ https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523958#comment-15523958 ] Hudson commented on TIKA-2098: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1110 (See [https://builds.apache.org/job/Tika-trunk/1110/]) fix for TIKA-2098 contributed by alexshadow007 (alexshadow007: rev c33ac04618f97c06fe4508b5d41465b2c11ba1b9) * (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java > Tika.parseToString() with maxLength doesn't work correctly for PDF files > > > Key: TIKA-2098 > URL: https://issues.apache.org/jira/browse/TIKA-2098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 >Reporter: Alexander Kazakov > Labels: java, parser, pdf > > When parsing PDF file with Tika.parseToString(InputStream stream, Metadata > metadata, int maxLength) and maxLength < content size it throws Exception. > {code:java} > org.apache.tika.exception.TikaException: Unable to extract all PDF content > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:568) > Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a > string: Tika - Content Analysis Toolkit > at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302) > at > org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779) > at > org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738) > at > org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672) > at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392) > at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143) > at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111) > ... 35 more > Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained > more than 100 characters, and so your requested limit has been reached. To > receive the full text of the document, increase your limit. (Text up to the > limit is however available). > org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 > characters, and so your requested limit has been reached. To receive the full > text of the document, increase your limit. (Text up to the limit is however > available). > org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your > document contained more than 100 characters, and so your requested limit has > been reached. To receive the full text of the document, increase your limit. > (Text up to the limit is however available). > at > org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > at > org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46) > at > org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82) > at > org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140) > at > org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306) > at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300) > ... 43 more > Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained > more than 100 characters, and so your requested limit has been reached. To > receive the full text of the document, increase your limit. (Text up to the > limit is however available). > org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your > document contained more than 100 characters, and so your requested limit has > been reached. To receive the full text of the document, increase your limit. > (Text up to the limit is however
[jira] [Commented] (TIKA-2098) Tika.parseToString() with maxLength doesn't work correctly for PDF files
[ https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523900#comment-15523900 ] ASF GitHub Bot commented on TIKA-2098: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/134 > Tika.parseToString() with maxLength doesn't work correctly for PDF files > > > Key: TIKA-2098 > URL: https://issues.apache.org/jira/browse/TIKA-2098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 >Reporter: Alexander Kazakov > Labels: java, parser, pdf > > When parsing PDF file with Tika.parseToString(InputStream stream, Metadata > metadata, int maxLength) and maxLength < content size it throws Exception. > {code:java} > org.apache.tika.exception.TikaException: Unable to extract all PDF content > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:568) > Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a > string: Tika - Content Analysis Toolkit > at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302) > at > org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779) > at > org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738) > at > org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672) > at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392) > at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143) > at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111) > ... 35 more > Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained > more than 100 characters, and so your requested limit has been reached. To > receive the full text of the document, increase your limit. (Text up to the > limit is however available). > org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 > characters, and so your requested limit has been reached. To receive the full > text of the document, increase your limit. (Text up to the limit is however > available). > org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your > document contained more than 100 characters, and so your requested limit has > been reached. To receive the full text of the document, increase your limit. > (Text up to the limit is however available). > at > org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > at > org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46) > at > org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82) > at > org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140) > at > org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306) > at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300) > ... 43 more > Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained > more than 100 characters, and so your requested limit has been reached. To > receive the full text of the document, increase your limit. (Text up to the > limit is however available). > org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your > document contained more than 100 characters, and so your requested limit has > been reached. To receive the full text of the document, increase your limit. > (Text up to the limit is however available). > at > org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148) > at >
[jira] [Commented] (TIKA-2098) Tika.parseToString() with maxLength doesn't work correctly for PDF files
[ https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523870#comment-15523870 ] ASF GitHub Bot commented on TIKA-2098: -- GitHub user alexshadow007 opened a pull request: https://github.com/apache/tika/pull/134 fix for TIKA-2098 contributed by alexshadow007 You can merge this pull request into a Git repository by running: $ git pull https://github.com/alexshadow007/tika TIKA-2098 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/134.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #134 commit c33ac04618f97c06fe4508b5d41465b2c11ba1b9 Author: Alexander KazakovDate: 2016-09-26T18:48:11Z fix for TIKA-2098 contributed by alexshadow007 > Tika.parseToString() with maxLength doesn't work correctly for PDF files > > > Key: TIKA-2098 > URL: https://issues.apache.org/jira/browse/TIKA-2098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 >Reporter: Alexander Kazakov > Labels: java, parser, pdf > > When parsing PDF file with Tika.parseToString(InputStream stream, Metadata > metadata, int maxLength) and maxLength < content size it throws Exception. > {code:java} > org.apache.tika.exception.TikaException: Unable to extract all PDF content > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:568) > Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a > string: Tika - Content Analysis Toolkit > at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302) > at > org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779) > at > org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738) > at > org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672) > at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392) > at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143) > at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111) > ... 35 more > Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained > more than 100 characters, and so your requested limit has been reached. To > receive the full text of the document, increase your limit. (Text up to the > limit is however available). > org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 > characters, and so your requested limit has been reached. To receive the full > text of the document, increase your limit. (Text up to the limit is however > available). > org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your > document contained more than 100 characters, and so your requested limit has > been reached. To receive the full text of the document, increase your limit. > (Text up to the limit is however available). > at > org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) > at > org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46) > at > org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82) > at > org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140) > at > org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306) > at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300) > ... 43 more > Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained > more than 100 characters, and so your requested limit has been reached. To >