[jira] [Commented] (TIKA-2098) Tika.parseToString() with maxLength doesn't work correctly for PDF files

2016-11-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15642937#comment-15642937
 ] 

Hudson commented on TIKA-2098:
--

FAILURE: Integrated in Jenkins build tika-2.x-windows #69 (See 
[https://builds.apache.org/job/tika-2.x-windows/69/])
improve unit test for TIKA-2098 (tallison: rev 
6ca74bec6a1d448bbe3340d51dc84ca8ca58507a)
* (edit) 
tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java


> Tika.parseToString() with maxLength doesn't work correctly for PDF files
> 
>
> Key: TIKA-2098
> URL: https://issues.apache.org/jira/browse/TIKA-2098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Alexander Kazakov
>Assignee: Tim Allison
>  Labels: java, parser, pdf
> Fix For: 2.0, 1.14
>
>
> When parsing PDF file with Tika.parseToString(InputStream stream, Metadata 
> metadata, int maxLength) and maxLength < content size it throws Exception.
> {code:java}
> org.apache.tika.exception.TikaException: Unable to extract all PDF content
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:568)
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a 
> string: Tika - Content Analysis Toolkit
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
>   at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
>   ... 35 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 
> characters, and so your requested limit has been reached. To receive the full 
> text of the document, increase your limit. (Text up to the limit is however 
> available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however available).
>   at 
> org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>   at 
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
>   at 
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
>   at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
>   at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300)
>   ... 43 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the 

[jira] [Commented] (TIKA-2098) Tika.parseToString() with maxLength doesn't work correctly for PDF files

2016-11-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15626453#comment-15626453
 ] 

Hudson commented on TIKA-2098:
--

FAILURE: Integrated in Jenkins build tika-2.x #169 (See 
[https://builds.apache.org/job/tika-2.x/169/])
improve unit test for TIKA-2098 (tallison: rev 
6ca74bec6a1d448bbe3340d51dc84ca8ca58507a)
* (edit) 
tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java


> Tika.parseToString() with maxLength doesn't work correctly for PDF files
> 
>
> Key: TIKA-2098
> URL: https://issues.apache.org/jira/browse/TIKA-2098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Alexander Kazakov
>Assignee: Tim Allison
>  Labels: java, parser, pdf
> Fix For: 2.0, 1.14
>
>
> When parsing PDF file with Tika.parseToString(InputStream stream, Metadata 
> metadata, int maxLength) and maxLength < content size it throws Exception.
> {code:java}
> org.apache.tika.exception.TikaException: Unable to extract all PDF content
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:568)
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a 
> string: Tika - Content Analysis Toolkit
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
>   at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
>   ... 35 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 
> characters, and so your requested limit has been reached. To receive the full 
> text of the document, increase your limit. (Text up to the limit is however 
> available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however available).
>   at 
> org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>   at 
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
>   at 
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
>   at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
>   at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300)
>   ... 43 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however 

[jira] [Commented] (TIKA-2098) Tika.parseToString() with maxLength doesn't work correctly for PDF files

2016-11-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15626388#comment-15626388
 ] 

Hudson commented on TIKA-2098:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1131 (See 
[https://builds.apache.org/job/Tika-trunk/1131/])
improve test for TIKA-2098 (tallison: rev 
2df68c84b043f3158c0bdfa63d1a0c8d44d7e18a)
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java


> Tika.parseToString() with maxLength doesn't work correctly for PDF files
> 
>
> Key: TIKA-2098
> URL: https://issues.apache.org/jira/browse/TIKA-2098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Alexander Kazakov
>Assignee: Tim Allison
>  Labels: java, parser, pdf
> Fix For: 2.0, 1.14
>
>
> When parsing PDF file with Tika.parseToString(InputStream stream, Metadata 
> metadata, int maxLength) and maxLength < content size it throws Exception.
> {code:java}
> org.apache.tika.exception.TikaException: Unable to extract all PDF content
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:568)
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a 
> string: Tika - Content Analysis Toolkit
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
>   at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
>   ... 35 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 
> characters, and so your requested limit has been reached. To receive the full 
> text of the document, increase your limit. (Text up to the limit is however 
> available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however available).
>   at 
> org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>   at 
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
>   at 
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
>   at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
>   at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300)
>   ... 43 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however available).
>   at 
> 

[jira] [Commented] (TIKA-2098) Tika.parseToString() with maxLength doesn't work correctly for PDF files

2016-09-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524149#comment-15524149
 ] 

Hudson commented on TIKA-2098:
--

SUCCESS: Integrated in Jenkins build Tika-trunk # (See 
[https://builds.apache.org/job/Tika-trunk//])
TIKA-2098 small clean up.  Test for writelimitreached for each catchable 
(tallison: rev 9b497d1fef2fe183b2099f1a835113dade8a0227)
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java


> Tika.parseToString() with maxLength doesn't work correctly for PDF files
> 
>
> Key: TIKA-2098
> URL: https://issues.apache.org/jira/browse/TIKA-2098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Alexander Kazakov
>Assignee: Tim Allison
>  Labels: java, parser, pdf
> Fix For: 2.0, 1.14
>
>
> When parsing PDF file with Tika.parseToString(InputStream stream, Metadata 
> metadata, int maxLength) and maxLength < content size it throws Exception.
> {code:java}
> org.apache.tika.exception.TikaException: Unable to extract all PDF content
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:568)
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a 
> string: Tika - Content Analysis Toolkit
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
>   at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
>   ... 35 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 
> characters, and so your requested limit has been reached. To receive the full 
> text of the document, increase your limit. (Text up to the limit is however 
> available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however available).
>   at 
> org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>   at 
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
>   at 
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
>   at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
>   at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300)
>   ... 43 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To 

[jira] [Commented] (TIKA-2098) Tika.parseToString() with maxLength doesn't work correctly for PDF files

2016-09-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524143#comment-15524143
 ] 

Hudson commented on TIKA-2098:
--

SUCCESS: Integrated in Jenkins build tika-2.x #153 (See 
[https://builds.apache.org/job/tika-2.x/153/])
TIKA-2098 small clean up.  Test for writelimitreached for each catchable 
(tallison: rev cde4c0aa8b668e0964f2b83fab67588292ffc993)
* (edit) 
tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* (edit) 
tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java


> Tika.parseToString() with maxLength doesn't work correctly for PDF files
> 
>
> Key: TIKA-2098
> URL: https://issues.apache.org/jira/browse/TIKA-2098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Alexander Kazakov
>Assignee: Tim Allison
>  Labels: java, parser, pdf
> Fix For: 2.0, 1.14
>
>
> When parsing PDF file with Tika.parseToString(InputStream stream, Metadata 
> metadata, int maxLength) and maxLength < content size it throws Exception.
> {code:java}
> org.apache.tika.exception.TikaException: Unable to extract all PDF content
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:568)
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a 
> string: Tika - Content Analysis Toolkit
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
>   at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
>   ... 35 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 
> characters, and so your requested limit has been reached. To receive the full 
> text of the document, increase your limit. (Text up to the limit is however 
> available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however available).
>   at 
> org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>   at 
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
>   at 
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
>   at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
>   at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300)
>   ... 43 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 

[jira] [Commented] (TIKA-2098) Tika.parseToString() with maxLength doesn't work correctly for PDF files

2016-09-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524060#comment-15524060
 ] 

Hudson commented on TIKA-2098:
--

FAILURE: Integrated in Jenkins build tika-2.x-windows #57 (See 
[https://builds.apache.org/job/tika-2.x-windows/57/])
TIKA-2098 small clean up.  Test for writelimitreached for each catchable 
(tallison: rev cde4c0aa8b668e0964f2b83fab67588292ffc993)
* (edit) 
tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
* (edit) 
tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java


> Tika.parseToString() with maxLength doesn't work correctly for PDF files
> 
>
> Key: TIKA-2098
> URL: https://issues.apache.org/jira/browse/TIKA-2098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Alexander Kazakov
>Assignee: Tim Allison
>  Labels: java, parser, pdf
> Fix For: 2.0, 1.14
>
>
> When parsing PDF file with Tika.parseToString(InputStream stream, Metadata 
> metadata, int maxLength) and maxLength < content size it throws Exception.
> {code:java}
> org.apache.tika.exception.TikaException: Unable to extract all PDF content
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:568)
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a 
> string: Tika - Content Analysis Toolkit
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
>   at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
>   ... 35 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 
> characters, and so your requested limit has been reached. To receive the full 
> text of the document, increase your limit. (Text up to the limit is however 
> available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however available).
>   at 
> org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>   at 
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
>   at 
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
>   at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
>   at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300)
>   ... 43 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document 

[jira] [Commented] (TIKA-2098) Tika.parseToString() with maxLength doesn't work correctly for PDF files

2016-09-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523958#comment-15523958
 ] 

Hudson commented on TIKA-2098:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1110 (See 
[https://builds.apache.org/job/Tika-trunk/1110/])
fix for TIKA-2098 contributed by alexshadow007 (alexshadow007: rev 
c33ac04618f97c06fe4508b5d41465b2c11ba1b9)
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java


> Tika.parseToString() with maxLength doesn't work correctly for PDF files
> 
>
> Key: TIKA-2098
> URL: https://issues.apache.org/jira/browse/TIKA-2098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Alexander Kazakov
>  Labels: java, parser, pdf
>
> When parsing PDF file with Tika.parseToString(InputStream stream, Metadata 
> metadata, int maxLength) and maxLength < content size it throws Exception.
> {code:java}
> org.apache.tika.exception.TikaException: Unable to extract all PDF content
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:568)
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a 
> string: Tika - Content Analysis Toolkit
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
>   at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
>   ... 35 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 
> characters, and so your requested limit has been reached. To receive the full 
> text of the document, increase your limit. (Text up to the limit is however 
> available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however available).
>   at 
> org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>   at 
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
>   at 
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
>   at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
>   at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300)
>   ... 43 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however 

[jira] [Commented] (TIKA-2098) Tika.parseToString() with maxLength doesn't work correctly for PDF files

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523900#comment-15523900
 ] 

ASF GitHub Bot commented on TIKA-2098:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/134


> Tika.parseToString() with maxLength doesn't work correctly for PDF files
> 
>
> Key: TIKA-2098
> URL: https://issues.apache.org/jira/browse/TIKA-2098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Alexander Kazakov
>  Labels: java, parser, pdf
>
> When parsing PDF file with Tika.parseToString(InputStream stream, Metadata 
> metadata, int maxLength) and maxLength < content size it throws Exception.
> {code:java}
> org.apache.tika.exception.TikaException: Unable to extract all PDF content
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:568)
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a 
> string: Tika - Content Analysis Toolkit
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
>   at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
>   ... 35 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 
> characters, and so your requested limit has been reached. To receive the full 
> text of the document, increase your limit. (Text up to the limit is however 
> available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however available).
>   at 
> org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>   at 
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
>   at 
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
>   at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
>   at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300)
>   ... 43 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however available).
>   at 
> org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
>   at 
> 

[jira] [Commented] (TIKA-2098) Tika.parseToString() with maxLength doesn't work correctly for PDF files

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523870#comment-15523870
 ] 

ASF GitHub Bot commented on TIKA-2098:
--

GitHub user alexshadow007 opened a pull request:

https://github.com/apache/tika/pull/134

fix for TIKA-2098 contributed by alexshadow007



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/alexshadow007/tika TIKA-2098

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/134.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #134


commit c33ac04618f97c06fe4508b5d41465b2c11ba1b9
Author: Alexander Kazakov 
Date:   2016-09-26T18:48:11Z

fix for TIKA-2098 contributed by alexshadow007




> Tika.parseToString() with maxLength doesn't work correctly for PDF files
> 
>
> Key: TIKA-2098
> URL: https://issues.apache.org/jira/browse/TIKA-2098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Alexander Kazakov
>  Labels: java, parser, pdf
>
> When parsing PDF file with Tika.parseToString(InputStream stream, Metadata 
> metadata, int maxLength) and maxLength < content size it throws Exception.
> {code:java}
> org.apache.tika.exception.TikaException: Unable to extract all PDF content
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:135)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:568)
> Caused by: org.apache.commons.io.IOExceptionWithCause: Unable to write a 
> string: Tika - Content Analysis Toolkit
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:302)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeString(PDFTextStripper.java:779)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeLine(PDFTextStripper.java:1738)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:672)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
>   at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
>   ... 35 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
> receive the full text of the document, increase your limit. (Text up to the 
> limit is however available).
> org.apache.tika.sax.TaggedSAXException: Your document contained more than 100 
> characters, and so your requested limit has been reached. To receive the full 
> text of the document, increase your limit. (Text up to the limit is however 
> available).
> org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your 
> document contained more than 100 characters, and so your requested limit has 
> been reached. To receive the full text of the document, increase your limit. 
> (Text up to the limit is however available).
>   at 
> org.apache.tika.sax.TaggedContentHandler.handleException(TaggedContentHandler.java:113)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>   at 
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
>   at 
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
>   at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
>   at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
>   at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:300)
>   ... 43 more
> Caused by: org.apache.tika.sax.TaggedSAXException: Your document contained 
> more than 100 characters, and so your requested limit has been reached. To 
>