[jira] [Commented] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.
[ https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946387#comment-16946387 ] Luke Butters commented on TIKA-2955: Hi I made this PR: https://github.com/apache/tika/pull/285 Is that how you want it? > PDF parsing to XHTML results in tika attempting to write invalid HTML > characters. > - > > Key: TIKA-2955 > URL: https://issues.apache.org/jira/browse/TIKA-2955 > Project: Tika > Issue Type: Bug >Reporter: Luke Butters >Priority: Major > Attachments: 314.pdf, fix_with_tests.txt > > > Hi, I am trying to parse: [^314.pdf] > what is happening when I try to convert it to XHTML is my XML parser fails > because: > {code} > 14:35:12.876 [main] ERROR com.funnelback.common.filter.TikaFilterProvider - > Unable to filter stream with document type '.pdf' > org.xml.sax.SAXException: net.sf.saxon.trans.XPathException: Illegal HTML > character: decimal 147 > at > net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:538) > ~[Saxon-HE-9.9.0-2.jar:?] > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:274) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:229) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:556) > ~[tika-parsers-1.19.1.jar:1.19.1] > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) > ~[pdfbox-2.0.12.jar:2.0.12] > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > ~[tika-parsers-1.19.1.jar:1.19.1] > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > ~[tika-parsers-1.19.1.jar:1.19.1] > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ~[tika-core-1.19.1.jar:1.19.1] > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ~[tika-core-1.19.1.jar:1.19.1] > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > ~[tika-core-1.19.1.jar:1.19.1] > at > [removed section of trace] > Caused by: net.sf.saxon.trans.XPathException: Illegal HTML character: decimal > 147 > at net.sf.saxon.serialize.HTMLEmitter.writeEscape(HTMLEmitter.java:379) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.serialize.XMLEmitter.characters(XMLEmitter.java:662) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.serialize.HTMLEmitter.characters(HTMLEmitter.java:441) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.serialize.HTMLIndenter.characters(HTMLIndenter.java:216) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) > ~[Saxon-HE-9.9.0-2.jar:?] > at > net.sf.saxon.event.SequenceNormalizer.characters(SequenceNormalizer.java:183) > ~[Saxon-HE-9.9.0-2.jar:?] > at > net.sf.saxon.event.ReceivingContentHandler.flush(ReceivingContentHandler.java:646) > ~[Saxon-HE-9.9.0-2.jar:?] > at > net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:526) > ~[Saxon-HE-9.9.0-2.jar:?] > ... 43 more > {code} > It looks like tika is asking the XML library to handle chracter 147 ie 0x93 > which is not allowed in HTML. > This saxon XML library is not happy with that, I think the default java one > doesn't complain when given the invalid character though, however tika is > probably wrong to write out that character when writing XHTML. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.
[ https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946386#comment-16946386 ] ASF GitHub Bot commented on TIKA-2955: -- LukeButters commented on pull request #285: Fix for TIKA-2955 filter out invalid HTML characters 0x7F to 0x9F URL: https://github.com/apache/tika/pull/285 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > PDF parsing to XHTML results in tika attempting to write invalid HTML > characters. > - > > Key: TIKA-2955 > URL: https://issues.apache.org/jira/browse/TIKA-2955 > Project: Tika > Issue Type: Bug >Reporter: Luke Butters >Priority: Major > Attachments: 314.pdf, fix_with_tests.txt > > > Hi, I am trying to parse: [^314.pdf] > what is happening when I try to convert it to XHTML is my XML parser fails > because: > {code} > 14:35:12.876 [main] ERROR com.funnelback.common.filter.TikaFilterProvider - > Unable to filter stream with document type '.pdf' > org.xml.sax.SAXException: net.sf.saxon.trans.XPathException: Illegal HTML > character: decimal 147 > at > net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:538) > ~[Saxon-HE-9.9.0-2.jar:?] > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:274) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:229) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:556) > ~[tika-parsers-1.19.1.jar:1.19.1] > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) > ~[pdfbox-2.0.12.jar:2.0.12] > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > ~[tika-parsers-1.19.1.jar:1.19.1] > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > ~[tika-parsers-1.19.1.jar:1.19.1] > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ~[tika-core-1.19.1.jar:1.19.1] > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ~[tika-core-1.19.1.jar:1.19.1] > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > ~[tika-core-1.19.1.jar:1.19.1] > at > [removed section of trace] > Caused by: net.sf.saxon.trans.XPathException: Illegal HTML character: decimal > 147 > at net.sf.saxon.serialize.HTMLEmitter.writeEscape(HTMLEmitter.java:379) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.serialize.XMLEmitter.characters(XMLEmitter.java:662) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.serialize.HTMLEmitter.characters(HTMLEmitter.java:441) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.serialize.HTMLIndenter.characters(HTMLIndenter.java:216) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) > ~[Saxon-HE-9.9.0-2.jar:?] > at > net.sf.saxon.event.SequenceNormalizer.characters(SequenceNormalizer.java:183) > ~[Saxon-HE-9.9.0-2.jar:?] > at > net.sf.saxon.event.ReceivingContentHandler.flush(ReceivingContentHandler.java:646) > ~[Saxon-HE-9.9.0-2.jar:?] > at > net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:526) > ~[Saxon-HE-9.9.0-2.jar:?] > ... 43 more > {code} > It looks like tika is asking the XML library to handle chracter 147 ie 0x93 > which is not allowed in HTML. > This saxon XML library is not happy with that, I think the default java one > doesn't complain when given the invalid character though, however tika is > probably wrong to write out that character when writing XHTML. -- This message was sent by Atlassian Jira (v8.3
[jira] [Commented] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.
[ https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946362#comment-16946362 ] Tim Allison commented on TIKA-2955: --- If you make the PR against master, I’ll cherry-pick it to branxh_1x. I’m happy to take the patch as is. Thank you for digging into the details! > PDF parsing to XHTML results in tika attempting to write invalid HTML > characters. > - > > Key: TIKA-2955 > URL: https://issues.apache.org/jira/browse/TIKA-2955 > Project: Tika > Issue Type: Bug >Reporter: Luke Butters >Priority: Major > Attachments: 314.pdf, fix_with_tests.txt > > > Hi, I am trying to parse: [^314.pdf] > what is happening when I try to convert it to XHTML is my XML parser fails > because: > {code} > 14:35:12.876 [main] ERROR com.funnelback.common.filter.TikaFilterProvider - > Unable to filter stream with document type '.pdf' > org.xml.sax.SAXException: net.sf.saxon.trans.XPathException: Illegal HTML > character: decimal 147 > at > net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:538) > ~[Saxon-HE-9.9.0-2.jar:?] > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:274) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:229) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:556) > ~[tika-parsers-1.19.1.jar:1.19.1] > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) > ~[pdfbox-2.0.12.jar:2.0.12] > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > ~[tika-parsers-1.19.1.jar:1.19.1] > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > ~[tika-parsers-1.19.1.jar:1.19.1] > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ~[tika-core-1.19.1.jar:1.19.1] > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ~[tika-core-1.19.1.jar:1.19.1] > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > ~[tika-core-1.19.1.jar:1.19.1] > at > [removed section of trace] > Caused by: net.sf.saxon.trans.XPathException: Illegal HTML character: decimal > 147 > at net.sf.saxon.serialize.HTMLEmitter.writeEscape(HTMLEmitter.java:379) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.serialize.XMLEmitter.characters(XMLEmitter.java:662) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.serialize.HTMLEmitter.characters(HTMLEmitter.java:441) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.serialize.HTMLIndenter.characters(HTMLIndenter.java:216) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) > ~[Saxon-HE-9.9.0-2.jar:?] > at > net.sf.saxon.event.SequenceNormalizer.characters(SequenceNormalizer.java:183) > ~[Saxon-HE-9.9.0-2.jar:?] > at > net.sf.saxon.event.ReceivingContentHandler.flush(ReceivingContentHandler.java:646) > ~[Saxon-HE-9.9.0-2.jar:?] > at > net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:526) > ~[Saxon-HE-9.9.0-2.jar:?] > ... 43 more > {code} > It looks like tika is asking the XML library to handle chracter 147 ie 0x93 > which is not allowed in HTML. > This saxon XML library is not happy with that, I think the default java one > doesn't complain when given the invalid character though, however tika is > probably wrong to write out that character when writing XHTML. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.
[ https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Butters updated TIKA-2955: --- Attachment: fix_with_tests.txt > PDF parsing to XHTML results in tika attempting to write invalid HTML > characters. > - > > Key: TIKA-2955 > URL: https://issues.apache.org/jira/browse/TIKA-2955 > Project: Tika > Issue Type: Bug >Reporter: Luke Butters >Priority: Major > Attachments: 314.pdf, fix_with_tests.txt > > > Hi, I am trying to parse: [^314.pdf] > what is happening when I try to convert it to XHTML is my XML parser fails > because: > {code} > 14:35:12.876 [main] ERROR com.funnelback.common.filter.TikaFilterProvider - > Unable to filter stream with document type '.pdf' > org.xml.sax.SAXException: net.sf.saxon.trans.XPathException: Illegal HTML > character: decimal 147 > at > net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:538) > ~[Saxon-HE-9.9.0-2.jar:?] > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:274) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:229) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:556) > ~[tika-parsers-1.19.1.jar:1.19.1] > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) > ~[pdfbox-2.0.12.jar:2.0.12] > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) > ~[tika-parsers-1.19.1.jar:1.19.1] > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) > ~[tika-parsers-1.19.1.jar:1.19.1] > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ~[tika-core-1.19.1.jar:1.19.1] > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ~[tika-core-1.19.1.jar:1.19.1] > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > ~[tika-core-1.19.1.jar:1.19.1] > at > [removed section of trace] > Caused by: net.sf.saxon.trans.XPathException: Illegal HTML character: decimal > 147 > at net.sf.saxon.serialize.HTMLEmitter.writeEscape(HTMLEmitter.java:379) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.serialize.XMLEmitter.characters(XMLEmitter.java:662) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.serialize.HTMLEmitter.characters(HTMLEmitter.java:441) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.serialize.HTMLIndenter.characters(HTMLIndenter.java:216) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) > ~[Saxon-HE-9.9.0-2.jar:?] > at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) > ~[Saxon-HE-9.9.0-2.jar:?] > at > net.sf.saxon.event.SequenceNormalizer.characters(SequenceNormalizer.java:183) > ~[Saxon-HE-9.9.0-2.jar:?] > at > net.sf.saxon.event.ReceivingContentHandler.flush(ReceivingContentHandler.java:646) > ~[Saxon-HE-9.9.0-2.jar:?] > at > net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:526) > ~[Saxon-HE-9.9.0-2.jar:?] > ... 43 more > {code} > It looks like tika is asking the XML library to handle chracter 147 ie 0x93 > which is not allowed in HTML. > This saxon XML library is not happy with that, I think the default java one > doesn't complain when given the invalid character though, however tika is > probably wrong to write out that character when writing XHTML. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.
[ https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946270#comment-16946270 ] Luke Butters edited comment on TIKA-2955 at 10/7/19 9:53 PM: - So [wikipedia Valid_characters_in_XML|https://en.wikipedia.org/wiki/Valid_characters_in_XML] has this to says for XML 1.0 this range is valid: {quote} U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters in the BMP (all surrogates, U+FFFE and U+ are forbidden); {quote} it goes on to say: {quote} The preceding code points ranges contain the following controls which are only valid in certain contexts in XML 1.0 documents, and whose usage is restricted and highly discouraged: U+007F–U+0084, U+0086–U+009F: this includes a C0 control character and all but one C1 control. {quote} I think most of that range is allowed in XML, although discouraged. Going over to https://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#character-references it says: {quote} The numeric character reference forms described above are allowed to reference any Unicode code point other than U+, U+000D, permanently undefined Unicode characters (noncharacters), and control characters other than space characters. {quote} I think it is trying to say it exclude control characters from those encodings. Looking at: ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt {code} 007F;;Cc;0;BN;N;DELETE 0080;;Cc;0;BN;N; 0081;;Cc;0;BN;N; 0082;;Cc;0;BN;N;BREAK PERMITTED HERE 0083;;Cc;0;BN;N;NO BREAK HERE 0084;;Cc;0;BN;N; 0085;;Cc;0;B;N;NEXT LINE (NEL) 0086;;Cc;0;BN;N;START OF SELECTED AREA 0087;;Cc;0;BN;N;END OF SELECTED AREA 0088;;Cc;0;BN;N;CHARACTER TABULATION SET 0089;;Cc;0;BN;N;CHARACTER TABULATION WITH JUSTIFICATION 008A;;Cc;0;BN;N;LINE TABULATION SET 008B;;Cc;0;BN;N;PARTIAL LINE FORWARD 008C;;Cc;0;BN;N;PARTIAL LINE BACKWARD 008D;;Cc;0;BN;N;REVERSE LINE FEED 008E;;Cc;0;BN;N;SINGLE SHIFT TWO 008F;;Cc;0;BN;N;SINGLE SHIFT THREE 0090;;Cc;0;BN;N;DEVICE CONTROL STRING 0091;;Cc;0;BN;N;PRIVATE USE ONE 0092;;Cc;0;BN;N;PRIVATE USE TWO 0093;;Cc;0;BN;N;SET TRANSMIT STATE 0094;;Cc;0;BN;N;CANCEL CHARACTER 0095;;Cc;0;BN;N;MESSAGE WAITING 0096;;Cc;0;BN;N;START OF GUARDED AREA 0097;;Cc;0;BN;N;END OF GUARDED AREA 0098;;Cc;0;BN;N;START OF STRING 0099;;Cc;0;BN;N; 009A;;Cc;0;BN;N;SINGLE CHARACTER INTRODUCER 009B;;Cc;0;BN;N;CONTROL SEQUENCE INTRODUCER 009C;;Cc;0;BN;N;STRING TERMINATOR 009D;;Cc;0;BN;N;OPERATING SYSTEM COMMAND 009E;;Cc;0;BN;N;PRIVACY MESSAGE 009F;;Cc;0;BN;N;APPLICATION PROGRAM COMMAND {code} I then remembered https://validator.w3.org/nu/#textarea exists and tried out {{}} the validator did not like that and said: {code} Character reference expands to a control character (U+007f). {code} So I think it is invalid only HTML but ok in XML. Should i be making a pull request on version 2 or on the latest version 1.x branch? Here are the changes though [^fix_with_tests.txt] was (Author: lukebutters7): So [wikipedia Valid_characters_in_XML|https://en.wikipedia.org/wiki/Valid_characters_in_XML] has this to says for XML 1.0 this range is valid: {quote} U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters in the BMP (all surrogates, U+FFFE and U+ are forbidden); {quote} it goes on to say: {quote} The preceding code points ranges contain the following controls which are only valid in certain contexts in XML 1.0 documents, and whose usage is restricted and highly discouraged: U+007F–U+0084, U+0086–U+009F: this includes a C0 control character and all but one C1 control. {quote} I think most of that range is allowed in XML, although discouraged. Going over to https://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#character-references it says: {quote} The numeric character reference forms described above are allowed to reference any Unicode code point other than U+, U+000D, permanently undefined Unicode characters (noncharacters), and control characters other than space characters. {quote} I think it is trying to say it exclude control characters from those encodings. Looking at: ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt {code} 007F;;Cc;0;BN;N;DELETE 0080;;Cc;0;BN;N; 0081;;Cc;0;BN;N; 0082;;Cc;0;BN;N;BREAK PERMITTED HERE 0083;;Cc;0;BN;N;NO BREAK HERE 0084;;Cc;0;BN;N; 0085;;Cc;0;B;N;NEXT LINE (NEL) 0086;;Cc;0;BN;N;START OF SELECTED AREA 0087;;Cc;0;BN;N;END OF SELECTED AREA 0088;;Cc;0;BN;N;CHARACTER TABULATION SET 0089;;Cc;0;BN;N;CHARACTER TABULATION WITH JUSTIFICATION 008A;;Cc;0;BN;N;LINE TABULATION SET 008B
[jira] [Comment Edited] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.
[ https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946270#comment-16946270 ] Luke Butters edited comment on TIKA-2955 at 10/7/19 9:41 PM: - So [wikipedia Valid_characters_in_XML|https://en.wikipedia.org/wiki/Valid_characters_in_XML] has this to says for XML 1.0 this range is valid: {quote} U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters in the BMP (all surrogates, U+FFFE and U+ are forbidden); {quote} it goes on to say: {quote} The preceding code points ranges contain the following controls which are only valid in certain contexts in XML 1.0 documents, and whose usage is restricted and highly discouraged: U+007F–U+0084, U+0086–U+009F: this includes a C0 control character and all but one C1 control. {quote} I think most of that range is allowed in XML, although discouraged. Going over to https://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#character-references it says: {quote} The numeric character reference forms described above are allowed to reference any Unicode code point other than U+, U+000D, permanently undefined Unicode characters (noncharacters), and control characters other than space characters. {quote} I think it is trying to say it exclude control characters from those encodings. Looking at: ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt {code} 007F;;Cc;0;BN;N;DELETE 0080;;Cc;0;BN;N; 0081;;Cc;0;BN;N; 0082;;Cc;0;BN;N;BREAK PERMITTED HERE 0083;;Cc;0;BN;N;NO BREAK HERE 0084;;Cc;0;BN;N; 0085;;Cc;0;B;N;NEXT LINE (NEL) 0086;;Cc;0;BN;N;START OF SELECTED AREA 0087;;Cc;0;BN;N;END OF SELECTED AREA 0088;;Cc;0;BN;N;CHARACTER TABULATION SET 0089;;Cc;0;BN;N;CHARACTER TABULATION WITH JUSTIFICATION 008A;;Cc;0;BN;N;LINE TABULATION SET 008B;;Cc;0;BN;N;PARTIAL LINE FORWARD 008C;;Cc;0;BN;N;PARTIAL LINE BACKWARD 008D;;Cc;0;BN;N;REVERSE LINE FEED 008E;;Cc;0;BN;N;SINGLE SHIFT TWO 008F;;Cc;0;BN;N;SINGLE SHIFT THREE 0090;;Cc;0;BN;N;DEVICE CONTROL STRING 0091;;Cc;0;BN;N;PRIVATE USE ONE 0092;;Cc;0;BN;N;PRIVATE USE TWO 0093;;Cc;0;BN;N;SET TRANSMIT STATE 0094;;Cc;0;BN;N;CANCEL CHARACTER 0095;;Cc;0;BN;N;MESSAGE WAITING 0096;;Cc;0;BN;N;START OF GUARDED AREA 0097;;Cc;0;BN;N;END OF GUARDED AREA 0098;;Cc;0;BN;N;START OF STRING 0099;;Cc;0;BN;N; 009A;;Cc;0;BN;N;SINGLE CHARACTER INTRODUCER 009B;;Cc;0;BN;N;CONTROL SEQUENCE INTRODUCER 009C;;Cc;0;BN;N;STRING TERMINATOR 009D;;Cc;0;BN;N;OPERATING SYSTEM COMMAND 009E;;Cc;0;BN;N;PRIVACY MESSAGE 009F;;Cc;0;BN;N;APPLICATION PROGRAM COMMAND {code} I then remembered https://validator.w3.org/nu/#textarea exists and tried out {{}} the validator did not like that and said: {code} Character reference expands to a control character (U+007f). {code} So I think it is invalid only HTML but ok in XML. Should i be making a pull request on version 2 or on the latest version 1.x branch? was (Author: lukebutters7): So [wikipedia Valid_characters_in_XML|https://en.wikipedia.org/wiki/Valid_characters_in_XML] has this to says for XML 1.0 this range is valid: {quote} U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters in the BMP (all surrogates, U+FFFE and U+ are forbidden); {quote} it goes on to say: {quote} The preceding code points ranges contain the following controls which are only valid in certain contexts in XML 1.0 documents, and whose usage is restricted and highly discouraged: U+007F–U+0084, U+0086–U+009F: this includes a C0 control character and all but one C1 control. {quote} I think most of that range is allowed in XML, although discouraged. Going over to https://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#character-references it says: {quote} The numeric character reference forms described above are allowed to reference any Unicode code point other than U+, U+000D, permanently undefined Unicode characters (noncharacters), and control characters other than space characters. {quote} I think it is trying to say it exclude control characters from those encodings. Looking at: ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt {code} 007F;;Cc;0;BN;N;DELETE 0080;;Cc;0;BN;N; 0081;;Cc;0;BN;N; 0082;;Cc;0;BN;N;BREAK PERMITTED HERE 0083;;Cc;0;BN;N;NO BREAK HERE 0084;;Cc;0;BN;N; 0085;;Cc;0;B;N;NEXT LINE (NEL) 0086;;Cc;0;BN;N;START OF SELECTED AREA 0087;;Cc;0;BN;N;END OF SELECTED AREA 0088;;Cc;0;BN;N;CHARACTER TABULATION SET 0089;;Cc;0;BN;N;CHARACTER TABULATION WITH JUSTIFICATION 008A;;Cc;0;BN;N;LINE TABULATION SET 008B;;Cc;0;BN;N;PARTIAL LINE FORWARD 008C;;Cc;0;
[jira] [Commented] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.
[ https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946270#comment-16946270 ] Luke Butters commented on TIKA-2955: So [wikipedia Valid_characters_in_XML|https://en.wikipedia.org/wiki/Valid_characters_in_XML] has this to says for XML 1.0 this range is valid: {quote} U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters in the BMP (all surrogates, U+FFFE and U+ are forbidden); {quote} it goes on to say: {quote} The preceding code points ranges contain the following controls which are only valid in certain contexts in XML 1.0 documents, and whose usage is restricted and highly discouraged: U+007F–U+0084, U+0086–U+009F: this includes a C0 control character and all but one C1 control. {quote} I think most of that range is allowed in XML, although discouraged. Going over to https://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#character-references it says: {quote} The numeric character reference forms described above are allowed to reference any Unicode code point other than U+, U+000D, permanently undefined Unicode characters (noncharacters), and control characters other than space characters. {quote} I think it is trying to say it exclude control characters from those encodings. Looking at: ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt {code} 007F;;Cc;0;BN;N;DELETE 0080;;Cc;0;BN;N; 0081;;Cc;0;BN;N; 0082;;Cc;0;BN;N;BREAK PERMITTED HERE 0083;;Cc;0;BN;N;NO BREAK HERE 0084;;Cc;0;BN;N; 0085;;Cc;0;B;N;NEXT LINE (NEL) 0086;;Cc;0;BN;N;START OF SELECTED AREA 0087;;Cc;0;BN;N;END OF SELECTED AREA 0088;;Cc;0;BN;N;CHARACTER TABULATION SET 0089;;Cc;0;BN;N;CHARACTER TABULATION WITH JUSTIFICATION 008A;;Cc;0;BN;N;LINE TABULATION SET 008B;;Cc;0;BN;N;PARTIAL LINE FORWARD 008C;;Cc;0;BN;N;PARTIAL LINE BACKWARD 008D;;Cc;0;BN;N;REVERSE LINE FEED 008E;;Cc;0;BN;N;SINGLE SHIFT TWO 008F;;Cc;0;BN;N;SINGLE SHIFT THREE 0090;;Cc;0;BN;N;DEVICE CONTROL STRING 0091;;Cc;0;BN;N;PRIVATE USE ONE 0092;;Cc;0;BN;N;PRIVATE USE TWO 0093;;Cc;0;BN;N;SET TRANSMIT STATE 0094;;Cc;0;BN;N;CANCEL CHARACTER 0095;;Cc;0;BN;N;MESSAGE WAITING 0096;;Cc;0;BN;N;START OF GUARDED AREA 0097;;Cc;0;BN;N;END OF GUARDED AREA 0098;;Cc;0;BN;N;START OF STRING 0099;;Cc;0;BN;N; 009A;;Cc;0;BN;N;SINGLE CHARACTER INTRODUCER 009B;;Cc;0;BN;N;CONTROL SEQUENCE INTRODUCER 009C;;Cc;0;BN;N;STRING TERMINATOR 009D;;Cc;0;BN;N;OPERATING SYSTEM COMMAND 009E;;Cc;0;BN;N;PRIVACY MESSAGE 009F;;Cc;0;BN;N;APPLICATION PROGRAM COMMAND {code} I then remembered https://validator.w3.org/nu/#textarea exists and tried out {{}} the validator did not like that and said: {code} Character reference expands to a control character (U+007f). {code} So I think it is invalid only in HTML but ok in XML. > PDF parsing to XHTML results in tika attempting to write invalid HTML > characters. > - > > Key: TIKA-2955 > URL: https://issues.apache.org/jira/browse/TIKA-2955 > Project: Tika > Issue Type: Bug >Reporter: Luke Butters >Priority: Major > Attachments: 314.pdf > > > Hi, I am trying to parse: [^314.pdf] > what is happening when I try to convert it to XHTML is my XML parser fails > because: > {code} > 14:35:12.876 [main] ERROR com.funnelback.common.filter.TikaFilterProvider - > Unable to filter stream with document type '.pdf' > org.xml.sax.SAXException: net.sf.saxon.trans.XPathException: Illegal HTML > character: decimal 147 > at > net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:538) > ~[Saxon-HE-9.9.0-2.jar:?] > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:274) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:229) > ~[tika-core-1.19.1.jar:1.19.1] > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(Ab
[jira] [Comment Edited] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.
[ https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946270#comment-16946270 ] Luke Butters edited comment on TIKA-2955 at 10/7/19 9:40 PM: - So [wikipedia Valid_characters_in_XML|https://en.wikipedia.org/wiki/Valid_characters_in_XML] has this to says for XML 1.0 this range is valid: {quote} U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters in the BMP (all surrogates, U+FFFE and U+ are forbidden); {quote} it goes on to say: {quote} The preceding code points ranges contain the following controls which are only valid in certain contexts in XML 1.0 documents, and whose usage is restricted and highly discouraged: U+007F–U+0084, U+0086–U+009F: this includes a C0 control character and all but one C1 control. {quote} I think most of that range is allowed in XML, although discouraged. Going over to https://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#character-references it says: {quote} The numeric character reference forms described above are allowed to reference any Unicode code point other than U+, U+000D, permanently undefined Unicode characters (noncharacters), and control characters other than space characters. {quote} I think it is trying to say it exclude control characters from those encodings. Looking at: ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt {code} 007F;;Cc;0;BN;N;DELETE 0080;;Cc;0;BN;N; 0081;;Cc;0;BN;N; 0082;;Cc;0;BN;N;BREAK PERMITTED HERE 0083;;Cc;0;BN;N;NO BREAK HERE 0084;;Cc;0;BN;N; 0085;;Cc;0;B;N;NEXT LINE (NEL) 0086;;Cc;0;BN;N;START OF SELECTED AREA 0087;;Cc;0;BN;N;END OF SELECTED AREA 0088;;Cc;0;BN;N;CHARACTER TABULATION SET 0089;;Cc;0;BN;N;CHARACTER TABULATION WITH JUSTIFICATION 008A;;Cc;0;BN;N;LINE TABULATION SET 008B;;Cc;0;BN;N;PARTIAL LINE FORWARD 008C;;Cc;0;BN;N;PARTIAL LINE BACKWARD 008D;;Cc;0;BN;N;REVERSE LINE FEED 008E;;Cc;0;BN;N;SINGLE SHIFT TWO 008F;;Cc;0;BN;N;SINGLE SHIFT THREE 0090;;Cc;0;BN;N;DEVICE CONTROL STRING 0091;;Cc;0;BN;N;PRIVATE USE ONE 0092;;Cc;0;BN;N;PRIVATE USE TWO 0093;;Cc;0;BN;N;SET TRANSMIT STATE 0094;;Cc;0;BN;N;CANCEL CHARACTER 0095;;Cc;0;BN;N;MESSAGE WAITING 0096;;Cc;0;BN;N;START OF GUARDED AREA 0097;;Cc;0;BN;N;END OF GUARDED AREA 0098;;Cc;0;BN;N;START OF STRING 0099;;Cc;0;BN;N; 009A;;Cc;0;BN;N;SINGLE CHARACTER INTRODUCER 009B;;Cc;0;BN;N;CONTROL SEQUENCE INTRODUCER 009C;;Cc;0;BN;N;STRING TERMINATOR 009D;;Cc;0;BN;N;OPERATING SYSTEM COMMAND 009E;;Cc;0;BN;N;PRIVACY MESSAGE 009F;;Cc;0;BN;N;APPLICATION PROGRAM COMMAND {code} I then remembered https://validator.w3.org/nu/#textarea exists and tried out {{}} the validator did not like that and said: {code} Character reference expands to a control character (U+007f). {code} So I think it is invalid only HTML but ok in XML. was (Author: lukebutters7): So [wikipedia Valid_characters_in_XML|https://en.wikipedia.org/wiki/Valid_characters_in_XML] has this to says for XML 1.0 this range is valid: {quote} U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters in the BMP (all surrogates, U+FFFE and U+ are forbidden); {quote} it goes on to say: {quote} The preceding code points ranges contain the following controls which are only valid in certain contexts in XML 1.0 documents, and whose usage is restricted and highly discouraged: U+007F–U+0084, U+0086–U+009F: this includes a C0 control character and all but one C1 control. {quote} I think most of that range is allowed in XML, although discouraged. Going over to https://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#character-references it says: {quote} The numeric character reference forms described above are allowed to reference any Unicode code point other than U+, U+000D, permanently undefined Unicode characters (noncharacters), and control characters other than space characters. {quote} I think it is trying to say it exclude control characters from those encodings. Looking at: ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt {code} 007F;;Cc;0;BN;N;DELETE 0080;;Cc;0;BN;N; 0081;;Cc;0;BN;N; 0082;;Cc;0;BN;N;BREAK PERMITTED HERE 0083;;Cc;0;BN;N;NO BREAK HERE 0084;;Cc;0;BN;N; 0085;;Cc;0;B;N;NEXT LINE (NEL) 0086;;Cc;0;BN;N;START OF SELECTED AREA 0087;;Cc;0;BN;N;END OF SELECTED AREA 0088;;Cc;0;BN;N;CHARACTER TABULATION SET 0089;;Cc;0;BN;N;CHARACTER TABULATION WITH JUSTIFICATION 008A;;Cc;0;BN;N;LINE TABULATION SET 008B;;Cc;0;BN;N;PARTIAL LINE FORWARD 008C;;Cc;0;BN;N;PARTIAL LINE BACKWARD 008D;;Cc;0;BN;N;REVERSE LINE FEED 008E;;Cc;
[jira] [Commented] (TIKA-2941) OSGI bundle and app are not self-contained
[ https://issues.apache.org/jira/browse/TIKA-2941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16945889#comment-16945889 ] Bob Paulin commented on TIKA-2941: -- Just an update to provide some transparency around the "why" we got here. With the newer version of the maven-bundle-plugin when I revert my commit from before I do not see the transitive dependencies included if the tika-parsers are in provided scope. With tika-parsers being embedded it does not really make sense for it to be in provided scope anyways. However with tika-parsers as a compile time dependency all the transitive dependencies are included in maven which is what is being called out as the issue in this JIRA. The good thing from an OSGi perspective we're still OK since only the following packages are exported: {code:java} !org.apache.tika.parser, !org.apache.tika.parser.external, org.apache.tika.parser.*, org.apache.tika.metadata.serialization.*, {code} But the maven side still shows all the transitive dependencies coming through. So in an OSGi runtime all these packages are private as expected but in the development environment this is a bit confusing since maven shows them coming through. Will need some time to see if we can get the maven side of this equation right without breaking the OSGi side. Hopefully this helps provide some context around the problem we're solving. > OSGI bundle and app are not self-contained > -- > > Key: TIKA-2941 > URL: https://issues.apache.org/jira/browse/TIKA-2941 > Project: Tika > Issue Type: Bug >Affects Versions: 1.22 >Reporter: Peng Cheng >Priority: Major > > Tika bundle still have dependencies spilled out of its package and cause jar > hell everywhere. If tika bundle is declared in maven as a dependency, a maven > dependency:tree will indicate: > [INFO] | +- org.apache.tika:tika-bundle:jar:1.22:test > [INFO] | | +- org.apache.tika:tika-core:jar:1.22:test > [INFO] | | - org.apache.tika:tika-parsers:jar:1.22:test > [INFO] | | +- org.glassfish.jaxb:jaxb-runtime:jar:2.3.2:test > [INFO] | | | +- jakarta.xml.bind:jakarta.xml.bind-api:jar:2.3.2:test > [INFO] | | | +- org.glassfish.jaxb:txw2:jar:2.3.2:test > [INFO] | | | +- com.sun.istack:istack-commons-runtime:jar:3.0.8:test > [INFO] | | | +- org.jvnet.staxex:stax-ex:jar:1.8.1:test > [INFO] | | | - com.sun.xml.fastinfoset:FastInfoset:jar:1.2.16:test > [INFO] | | +- com.sun.activation:jakarta.activation:jar:1.2.1:test > [INFO] | | +- org.gagravarr:vorbis-java-tika:jar:0.8:test > [INFO] | | +- org.tallison:jmatio:jar:1.5:test > [INFO] | | +- org.apache.james:apache-mime4j-core:jar:0.8.3:test > [INFO] | | +- org.apache.james:apache-mime4j-dom:jar:0.8.3:test > [INFO] | | +- com.epam:parso:jar:2.0.11:test > [INFO] | | +- org.brotli:dec:jar:0.1.2:test > [INFO] | | +- org.apache.pdfbox:pdfbox:jar:2.0.16:test > [INFO] | | | - org.apache.pdfbox:fontbox:jar:2.0.16:test > [INFO] | | +- org.apache.pdfbox:pdfbox-tools:jar:2.0.16:test > [INFO] | | +- org.apache.pdfbox:jempbox:jar:1.8.16:test > [INFO] | | +- org.bouncycastle:bcmail-jdk15on:jar:1.62:test > [INFO] | | | - org.bouncycastle:bcpkix-jdk15on:jar:1.62:test > [INFO] | | +- org.bouncycastle:bcprov-jdk15on:jar:1.62:test > [INFO] | | +- org.apache.poi:poi:jar:4.0.1:test > [INFO] | | | - org.apache.commons:commons-collections4:jar:4.2:test > [INFO] | | +- org.apache.poi:poi-scratchpad:jar:4.0.1:test > [INFO] | | +- org.apache.poi:poi-ooxml:jar:4.0.1:test > [INFO] | | | +- org.apache.poi:poi-ooxml-schemas:jar:4.0.1:test > [INFO] | | | | - org.apache.xmlbeans:xmlbeans:jar:3.0.2:test > [INFO] | | | - com.github.virtuald:curvesapi:jar:1.05:test > [INFO] | | +- com.healthmarketscience.jackcess:jackcess:jar:3.0.1:test > [INFO] | | +- > com.healthmarketscience.jackcess:jackcess-encrypt:jar:3.0.0:test > [INFO] | | +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:test > [INFO] | | +- org.ow2.asm:asm:jar:7.2-beta:test > [INFO] | | +- com.googlecode.mp4parser:isoparser:jar:1.1.22:test > [INFO] | | +- com.drewnoakes:metadata-extractor:jar:2.11.0:test > [INFO] | | | - com.adobe.xmp:xmpcore:jar:5.1.3:test > [INFO] | | +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:test > [INFO] | | +- com.rometools:rome:jar:1.12.1:test > [INFO] | | | - com.rometools:rome-utils:jar:1.12.1:test > [INFO] | | +- org.gagravarr:vorbis-java-core:jar:0.8:test > [INFO] | | +- org.codelibs:jhighlight:jar:1.0.3:test > [INFO] | | +- com.pff:java-libpst:jar:0.8.1:test > [INFO] | | +- com.github.junrar:junrar:jar:4.0.0:test > [INFO] | | +- org.apache.cxf:cxf-rt-rs-client:jar:3.3.2:test > [INFO] | | | +- org.apache.cxf:cxf-rt-transports-http:jar:3.3.2:test > [INFO] | | | +- org.apache.cxf:cxf-core:jar:3.3.2:test > [INFO] |