[jira] [Commented] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.

2019-10-07 Thread Luke Butters (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946387#comment-16946387
 ] 

Luke Butters commented on TIKA-2955:


Hi I made this PR:
https://github.com/apache/tika/pull/285

Is that how you want it?


> PDF parsing to XHTML results in tika attempting to write invalid HTML 
> characters.
> -
>
> Key: TIKA-2955
> URL: https://issues.apache.org/jira/browse/TIKA-2955
> Project: Tika
>  Issue Type: Bug
>Reporter: Luke Butters
>Priority: Major
> Attachments: 314.pdf, fix_with_tests.txt
>
>
> Hi, I am trying to parse: [^314.pdf]
> what is happening when I try to convert it to XHTML is my XML parser fails 
> because:
> {code}
> 14:35:12.876 [main] ERROR com.funnelback.common.filter.TikaFilterProvider - 
> Unable to filter stream with document type '.pdf'
> org.xml.sax.SAXException: net.sf.saxon.trans.XPathException: Illegal HTML 
> character: decimal 147
>  at 
> net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:538)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:274)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:229)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:556)
>  ~[tika-parsers-1.19.1.jar:1.19.1]
>  at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) 
> ~[pdfbox-2.0.12.jar:2.0.12]
>  at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) 
> ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) 
> ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> [removed section of trace]
> Caused by: net.sf.saxon.trans.XPathException: Illegal HTML character: decimal 
> 147
>  at net.sf.saxon.serialize.HTMLEmitter.writeEscape(HTMLEmitter.java:379) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.XMLEmitter.characters(XMLEmitter.java:662) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLEmitter.characters(HTMLEmitter.java:441) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLIndenter.characters(HTMLIndenter.java:216) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.SequenceNormalizer.characters(SequenceNormalizer.java:183) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.ReceivingContentHandler.flush(ReceivingContentHandler.java:646)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:526)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  ... 43 more
> {code}
> It looks like tika is asking the XML library to handle chracter 147 ie 0x93 
> which is not allowed in HTML.
> This saxon XML library is not happy with that, I think the default java one 
> doesn't complain when given the invalid character though, however tika is 
> probably wrong to write out that character when writing XHTML.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.

2019-10-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946386#comment-16946386
 ] 

ASF GitHub Bot commented on TIKA-2955:
--

LukeButters commented on pull request #285: Fix for TIKA-2955 filter out 
invalid HTML characters 0x7F to 0x9F
URL: https://github.com/apache/tika/pull/285
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> PDF parsing to XHTML results in tika attempting to write invalid HTML 
> characters.
> -
>
> Key: TIKA-2955
> URL: https://issues.apache.org/jira/browse/TIKA-2955
> Project: Tika
>  Issue Type: Bug
>Reporter: Luke Butters
>Priority: Major
> Attachments: 314.pdf, fix_with_tests.txt
>
>
> Hi, I am trying to parse: [^314.pdf]
> what is happening when I try to convert it to XHTML is my XML parser fails 
> because:
> {code}
> 14:35:12.876 [main] ERROR com.funnelback.common.filter.TikaFilterProvider - 
> Unable to filter stream with document type '.pdf'
> org.xml.sax.SAXException: net.sf.saxon.trans.XPathException: Illegal HTML 
> character: decimal 147
>  at 
> net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:538)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:274)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:229)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:556)
>  ~[tika-parsers-1.19.1.jar:1.19.1]
>  at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) 
> ~[pdfbox-2.0.12.jar:2.0.12]
>  at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) 
> ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) 
> ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> [removed section of trace]
> Caused by: net.sf.saxon.trans.XPathException: Illegal HTML character: decimal 
> 147
>  at net.sf.saxon.serialize.HTMLEmitter.writeEscape(HTMLEmitter.java:379) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.XMLEmitter.characters(XMLEmitter.java:662) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLEmitter.characters(HTMLEmitter.java:441) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLIndenter.characters(HTMLIndenter.java:216) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.SequenceNormalizer.characters(SequenceNormalizer.java:183) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.ReceivingContentHandler.flush(ReceivingContentHandler.java:646)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:526)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  ... 43 more
> {code}
> It looks like tika is asking the XML library to handle chracter 147 ie 0x93 
> which is not allowed in HTML.
> This saxon XML library is not happy with that, I think the default java one 
> doesn't complain when given the invalid character though, however tika is 
> probably wrong to write out that character when writing XHTML.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.

2019-10-07 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946362#comment-16946362
 ] 

Tim Allison commented on TIKA-2955:
---

If you make the PR against master, I’ll cherry-pick it to branxh_1x. I’m happy 
to take the patch as is. Thank you for digging into the details!

> PDF parsing to XHTML results in tika attempting to write invalid HTML 
> characters.
> -
>
> Key: TIKA-2955
> URL: https://issues.apache.org/jira/browse/TIKA-2955
> Project: Tika
>  Issue Type: Bug
>Reporter: Luke Butters
>Priority: Major
> Attachments: 314.pdf, fix_with_tests.txt
>
>
> Hi, I am trying to parse: [^314.pdf]
> what is happening when I try to convert it to XHTML is my XML parser fails 
> because:
> {code}
> 14:35:12.876 [main] ERROR com.funnelback.common.filter.TikaFilterProvider - 
> Unable to filter stream with document type '.pdf'
> org.xml.sax.SAXException: net.sf.saxon.trans.XPathException: Illegal HTML 
> character: decimal 147
>  at 
> net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:538)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:274)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:229)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:556)
>  ~[tika-parsers-1.19.1.jar:1.19.1]
>  at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) 
> ~[pdfbox-2.0.12.jar:2.0.12]
>  at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) 
> ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) 
> ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> [removed section of trace]
> Caused by: net.sf.saxon.trans.XPathException: Illegal HTML character: decimal 
> 147
>  at net.sf.saxon.serialize.HTMLEmitter.writeEscape(HTMLEmitter.java:379) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.XMLEmitter.characters(XMLEmitter.java:662) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLEmitter.characters(HTMLEmitter.java:441) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLIndenter.characters(HTMLIndenter.java:216) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.SequenceNormalizer.characters(SequenceNormalizer.java:183) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.ReceivingContentHandler.flush(ReceivingContentHandler.java:646)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:526)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  ... 43 more
> {code}
> It looks like tika is asking the XML library to handle chracter 147 ie 0x93 
> which is not allowed in HTML.
> This saxon XML library is not happy with that, I think the default java one 
> doesn't complain when given the invalid character though, however tika is 
> probably wrong to write out that character when writing XHTML.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.

2019-10-07 Thread Luke Butters (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Butters updated TIKA-2955:
---
Attachment: fix_with_tests.txt

> PDF parsing to XHTML results in tika attempting to write invalid HTML 
> characters.
> -
>
> Key: TIKA-2955
> URL: https://issues.apache.org/jira/browse/TIKA-2955
> Project: Tika
>  Issue Type: Bug
>Reporter: Luke Butters
>Priority: Major
> Attachments: 314.pdf, fix_with_tests.txt
>
>
> Hi, I am trying to parse: [^314.pdf]
> what is happening when I try to convert it to XHTML is my XML parser fails 
> because:
> {code}
> 14:35:12.876 [main] ERROR com.funnelback.common.filter.TikaFilterProvider - 
> Unable to filter stream with document type '.pdf'
> org.xml.sax.SAXException: net.sf.saxon.trans.XPathException: Illegal HTML 
> character: decimal 147
>  at 
> net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:538)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:274)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:229)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:556)
>  ~[tika-parsers-1.19.1.jar:1.19.1]
>  at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) 
> ~[pdfbox-2.0.12.jar:2.0.12]
>  at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) 
> ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) 
> ~[tika-parsers-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) 
> ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> [removed section of trace]
> Caused by: net.sf.saxon.trans.XPathException: Illegal HTML character: decimal 
> 147
>  at net.sf.saxon.serialize.HTMLEmitter.writeEscape(HTMLEmitter.java:379) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.XMLEmitter.characters(XMLEmitter.java:662) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLEmitter.characters(HTMLEmitter.java:441) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.serialize.HTMLIndenter.characters(HTMLIndenter.java:216) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at net.sf.saxon.event.ProxyReceiver.characters(ProxyReceiver.java:193) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.SequenceNormalizer.characters(SequenceNormalizer.java:183) 
> ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.ReceivingContentHandler.flush(ReceivingContentHandler.java:646)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:526)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  ... 43 more
> {code}
> It looks like tika is asking the XML library to handle chracter 147 ie 0x93 
> which is not allowed in HTML.
> This saxon XML library is not happy with that, I think the default java one 
> doesn't complain when given the invalid character though, however tika is 
> probably wrong to write out that character when writing XHTML.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.

2019-10-07 Thread Luke Butters (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946270#comment-16946270
 ] 

Luke Butters edited comment on TIKA-2955 at 10/7/19 9:53 PM:
-

So [wikipedia 
Valid_characters_in_XML|https://en.wikipedia.org/wiki/Valid_characters_in_XML] 
has this to says for XML 1.0 this range is valid:
{quote}
U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters 
in the BMP (all surrogates, U+FFFE and U+ are forbidden);
{quote}
 it goes on to say:
{quote}
The preceding code points ranges contain the following controls which are only 
valid in certain contexts in XML 1.0 documents, and whose usage is restricted 
and highly discouraged:
U+007F–U+0084, U+0086–U+009F: this includes a C0 control character and all 
but one C1 control.
{quote}
I think most of that range is allowed in XML, although discouraged.

Going over to 
https://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#character-references 
it says:
{quote}
The numeric character reference forms described above are allowed to reference 
any Unicode code point other than U+, U+000D, permanently undefined Unicode 
characters (noncharacters), and control characters other than space characters.
{quote}
I think it is trying to say it exclude control characters from those encodings.

Looking at: ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
{code}
007F;;Cc;0;BN;N;DELETE
0080;;Cc;0;BN;N;
0081;;Cc;0;BN;N;
0082;;Cc;0;BN;N;BREAK PERMITTED HERE
0083;;Cc;0;BN;N;NO BREAK HERE
0084;;Cc;0;BN;N;
0085;;Cc;0;B;N;NEXT LINE (NEL)
0086;;Cc;0;BN;N;START OF SELECTED AREA
0087;;Cc;0;BN;N;END OF SELECTED AREA
0088;;Cc;0;BN;N;CHARACTER TABULATION SET
0089;;Cc;0;BN;N;CHARACTER TABULATION WITH JUSTIFICATION
008A;;Cc;0;BN;N;LINE TABULATION SET
008B;;Cc;0;BN;N;PARTIAL LINE FORWARD
008C;;Cc;0;BN;N;PARTIAL LINE BACKWARD
008D;;Cc;0;BN;N;REVERSE LINE FEED
008E;;Cc;0;BN;N;SINGLE SHIFT TWO
008F;;Cc;0;BN;N;SINGLE SHIFT THREE
0090;;Cc;0;BN;N;DEVICE CONTROL STRING
0091;;Cc;0;BN;N;PRIVATE USE ONE
0092;;Cc;0;BN;N;PRIVATE USE TWO
0093;;Cc;0;BN;N;SET TRANSMIT STATE
0094;;Cc;0;BN;N;CANCEL CHARACTER
0095;;Cc;0;BN;N;MESSAGE WAITING
0096;;Cc;0;BN;N;START OF GUARDED AREA
0097;;Cc;0;BN;N;END OF GUARDED AREA
0098;;Cc;0;BN;N;START OF STRING
0099;;Cc;0;BN;N;
009A;;Cc;0;BN;N;SINGLE CHARACTER INTRODUCER
009B;;Cc;0;BN;N;CONTROL SEQUENCE INTRODUCER
009C;;Cc;0;BN;N;STRING TERMINATOR
009D;;Cc;0;BN;N;OPERATING SYSTEM COMMAND
009E;;Cc;0;BN;N;PRIVACY MESSAGE
009F;;Cc;0;BN;N;APPLICATION PROGRAM COMMAND
{code}

I then remembered https://validator.w3.org/nu/#textarea exists and tried out 
{{}} the validator did not like that and said:
{code}
Character reference expands to a control character (U+007f).
{code}

So I think it is invalid only HTML but ok in XML.

Should i be making a pull request on version 2 or on the latest version 1.x 
branch?
Here are the changes though  [^fix_with_tests.txt] 


was (Author: lukebutters7):
So [wikipedia 
Valid_characters_in_XML|https://en.wikipedia.org/wiki/Valid_characters_in_XML] 
has this to says for XML 1.0 this range is valid:
{quote}
U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters 
in the BMP (all surrogates, U+FFFE and U+ are forbidden);
{quote}
 it goes on to say:
{quote}
The preceding code points ranges contain the following controls which are only 
valid in certain contexts in XML 1.0 documents, and whose usage is restricted 
and highly discouraged:
U+007F–U+0084, U+0086–U+009F: this includes a C0 control character and all 
but one C1 control.
{quote}
I think most of that range is allowed in XML, although discouraged.

Going over to 
https://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#character-references 
it says:
{quote}
The numeric character reference forms described above are allowed to reference 
any Unicode code point other than U+, U+000D, permanently undefined Unicode 
characters (noncharacters), and control characters other than space characters.
{quote}
I think it is trying to say it exclude control characters from those encodings.

Looking at: ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
{code}
007F;;Cc;0;BN;N;DELETE
0080;;Cc;0;BN;N;
0081;;Cc;0;BN;N;
0082;;Cc;0;BN;N;BREAK PERMITTED HERE
0083;;Cc;0;BN;N;NO BREAK HERE
0084;;Cc;0;BN;N;
0085;;Cc;0;B;N;NEXT LINE (NEL)
0086;;Cc;0;BN;N;START OF SELECTED AREA
0087;;Cc;0;BN;N;END OF SELECTED AREA
0088;;Cc;0;BN;N;CHARACTER TABULATION SET
0089;;Cc;0;BN;N;CHARACTER TABULATION WITH JUSTIFICATION
008A;;Cc;0;BN;N;LINE TABULATION SET

[jira] [Comment Edited] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.

2019-10-07 Thread Luke Butters (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946270#comment-16946270
 ] 

Luke Butters edited comment on TIKA-2955 at 10/7/19 9:41 PM:
-

So [wikipedia 
Valid_characters_in_XML|https://en.wikipedia.org/wiki/Valid_characters_in_XML] 
has this to says for XML 1.0 this range is valid:
{quote}
U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters 
in the BMP (all surrogates, U+FFFE and U+ are forbidden);
{quote}
 it goes on to say:
{quote}
The preceding code points ranges contain the following controls which are only 
valid in certain contexts in XML 1.0 documents, and whose usage is restricted 
and highly discouraged:
U+007F–U+0084, U+0086–U+009F: this includes a C0 control character and all 
but one C1 control.
{quote}
I think most of that range is allowed in XML, although discouraged.

Going over to 
https://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#character-references 
it says:
{quote}
The numeric character reference forms described above are allowed to reference 
any Unicode code point other than U+, U+000D, permanently undefined Unicode 
characters (noncharacters), and control characters other than space characters.
{quote}
I think it is trying to say it exclude control characters from those encodings.

Looking at: ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
{code}
007F;;Cc;0;BN;N;DELETE
0080;;Cc;0;BN;N;
0081;;Cc;0;BN;N;
0082;;Cc;0;BN;N;BREAK PERMITTED HERE
0083;;Cc;0;BN;N;NO BREAK HERE
0084;;Cc;0;BN;N;
0085;;Cc;0;B;N;NEXT LINE (NEL)
0086;;Cc;0;BN;N;START OF SELECTED AREA
0087;;Cc;0;BN;N;END OF SELECTED AREA
0088;;Cc;0;BN;N;CHARACTER TABULATION SET
0089;;Cc;0;BN;N;CHARACTER TABULATION WITH JUSTIFICATION
008A;;Cc;0;BN;N;LINE TABULATION SET
008B;;Cc;0;BN;N;PARTIAL LINE FORWARD
008C;;Cc;0;BN;N;PARTIAL LINE BACKWARD
008D;;Cc;0;BN;N;REVERSE LINE FEED
008E;;Cc;0;BN;N;SINGLE SHIFT TWO
008F;;Cc;0;BN;N;SINGLE SHIFT THREE
0090;;Cc;0;BN;N;DEVICE CONTROL STRING
0091;;Cc;0;BN;N;PRIVATE USE ONE
0092;;Cc;0;BN;N;PRIVATE USE TWO
0093;;Cc;0;BN;N;SET TRANSMIT STATE
0094;;Cc;0;BN;N;CANCEL CHARACTER
0095;;Cc;0;BN;N;MESSAGE WAITING
0096;;Cc;0;BN;N;START OF GUARDED AREA
0097;;Cc;0;BN;N;END OF GUARDED AREA
0098;;Cc;0;BN;N;START OF STRING
0099;;Cc;0;BN;N;
009A;;Cc;0;BN;N;SINGLE CHARACTER INTRODUCER
009B;;Cc;0;BN;N;CONTROL SEQUENCE INTRODUCER
009C;;Cc;0;BN;N;STRING TERMINATOR
009D;;Cc;0;BN;N;OPERATING SYSTEM COMMAND
009E;;Cc;0;BN;N;PRIVACY MESSAGE
009F;;Cc;0;BN;N;APPLICATION PROGRAM COMMAND
{code}

I then remembered https://validator.w3.org/nu/#textarea exists and tried out 
{{}} the validator did not like that and said:
{code}
Character reference expands to a control character (U+007f).
{code}

So I think it is invalid only HTML but ok in XML.

Should i be making a pull request on version 2 or on the latest version 1.x 
branch?


was (Author: lukebutters7):
So [wikipedia 
Valid_characters_in_XML|https://en.wikipedia.org/wiki/Valid_characters_in_XML] 
has this to says for XML 1.0 this range is valid:
{quote}
U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters 
in the BMP (all surrogates, U+FFFE and U+ are forbidden);
{quote}
 it goes on to say:
{quote}
The preceding code points ranges contain the following controls which are only 
valid in certain contexts in XML 1.0 documents, and whose usage is restricted 
and highly discouraged:
U+007F–U+0084, U+0086–U+009F: this includes a C0 control character and all 
but one C1 control.
{quote}
I think most of that range is allowed in XML, although discouraged.

Going over to 
https://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#character-references 
it says:
{quote}
The numeric character reference forms described above are allowed to reference 
any Unicode code point other than U+, U+000D, permanently undefined Unicode 
characters (noncharacters), and control characters other than space characters.
{quote}
I think it is trying to say it exclude control characters from those encodings.

Looking at: ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
{code}
007F;;Cc;0;BN;N;DELETE
0080;;Cc;0;BN;N;
0081;;Cc;0;BN;N;
0082;;Cc;0;BN;N;BREAK PERMITTED HERE
0083;;Cc;0;BN;N;NO BREAK HERE
0084;;Cc;0;BN;N;
0085;;Cc;0;B;N;NEXT LINE (NEL)
0086;;Cc;0;BN;N;START OF SELECTED AREA
0087;;Cc;0;BN;N;END OF SELECTED AREA
0088;;Cc;0;BN;N;CHARACTER TABULATION SET
0089;;Cc;0;BN;N;CHARACTER TABULATION WITH JUSTIFICATION
008A;;Cc;0;BN;N;LINE TABULATION SET
008B;;Cc;0;BN;N;PARTIAL LINE FORWARD
008C;;Cc;0;BN;N;PARTIAL 

[jira] [Commented] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.

2019-10-07 Thread Luke Butters (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946270#comment-16946270
 ] 

Luke Butters commented on TIKA-2955:


So [wikipedia 
Valid_characters_in_XML|https://en.wikipedia.org/wiki/Valid_characters_in_XML] 
has this to says for XML 1.0 this range is valid:
{quote}
U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters 
in the BMP (all surrogates, U+FFFE and U+ are forbidden);
{quote}
 it goes on to say:
{quote}
The preceding code points ranges contain the following controls which are only 
valid in certain contexts in XML 1.0 documents, and whose usage is restricted 
and highly discouraged:
U+007F–U+0084, U+0086–U+009F: this includes a C0 control character and all 
but one C1 control.
{quote}
I think most of that range is allowed in XML, although discouraged.

Going over to 
https://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#character-references 
it says:
{quote}
The numeric character reference forms described above are allowed to reference 
any Unicode code point other than U+, U+000D, permanently undefined Unicode 
characters (noncharacters), and control characters other than space characters.
{quote}
I think it is trying to say it exclude control characters from those encodings.

Looking at: ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
{code}
007F;;Cc;0;BN;N;DELETE
0080;;Cc;0;BN;N;
0081;;Cc;0;BN;N;
0082;;Cc;0;BN;N;BREAK PERMITTED HERE
0083;;Cc;0;BN;N;NO BREAK HERE
0084;;Cc;0;BN;N;
0085;;Cc;0;B;N;NEXT LINE (NEL)
0086;;Cc;0;BN;N;START OF SELECTED AREA
0087;;Cc;0;BN;N;END OF SELECTED AREA
0088;;Cc;0;BN;N;CHARACTER TABULATION SET
0089;;Cc;0;BN;N;CHARACTER TABULATION WITH JUSTIFICATION
008A;;Cc;0;BN;N;LINE TABULATION SET
008B;;Cc;0;BN;N;PARTIAL LINE FORWARD
008C;;Cc;0;BN;N;PARTIAL LINE BACKWARD
008D;;Cc;0;BN;N;REVERSE LINE FEED
008E;;Cc;0;BN;N;SINGLE SHIFT TWO
008F;;Cc;0;BN;N;SINGLE SHIFT THREE
0090;;Cc;0;BN;N;DEVICE CONTROL STRING
0091;;Cc;0;BN;N;PRIVATE USE ONE
0092;;Cc;0;BN;N;PRIVATE USE TWO
0093;;Cc;0;BN;N;SET TRANSMIT STATE
0094;;Cc;0;BN;N;CANCEL CHARACTER
0095;;Cc;0;BN;N;MESSAGE WAITING
0096;;Cc;0;BN;N;START OF GUARDED AREA
0097;;Cc;0;BN;N;END OF GUARDED AREA
0098;;Cc;0;BN;N;START OF STRING
0099;;Cc;0;BN;N;
009A;;Cc;0;BN;N;SINGLE CHARACTER INTRODUCER
009B;;Cc;0;BN;N;CONTROL SEQUENCE INTRODUCER
009C;;Cc;0;BN;N;STRING TERMINATOR
009D;;Cc;0;BN;N;OPERATING SYSTEM COMMAND
009E;;Cc;0;BN;N;PRIVACY MESSAGE
009F;;Cc;0;BN;N;APPLICATION PROGRAM COMMAND
{code}

I then remembered https://validator.w3.org/nu/#textarea exists and tried out 
{{}} the validator did not like that and said:
{code}
Character reference expands to a control character (U+007f).
{code}

So I think it is invalid only in HTML but ok in XML.

> PDF parsing to XHTML results in tika attempting to write invalid HTML 
> characters.
> -
>
> Key: TIKA-2955
> URL: https://issues.apache.org/jira/browse/TIKA-2955
> Project: Tika
>  Issue Type: Bug
>Reporter: Luke Butters
>Priority: Major
> Attachments: 314.pdf
>
>
> Hi, I am trying to parse: [^314.pdf]
> what is happening when I try to convert it to XHTML is my XML parser fails 
> because:
> {code}
> 14:35:12.876 [main] ERROR com.funnelback.common.filter.TikaFilterProvider - 
> Unable to filter stream with document type '.pdf'
> org.xml.sax.SAXException: net.sf.saxon.trans.XPathException: Illegal HTML 
> character: decimal 147
>  at 
> net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHandler.java:538)
>  ~[Saxon-HE-9.9.0-2.jar:?]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:274)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:229)
>  ~[tika-core-1.19.1.jar:1.19.1]
>  at 
> 

[jira] [Comment Edited] (TIKA-2955) PDF parsing to XHTML results in tika attempting to write invalid HTML characters.

2019-10-07 Thread Luke Butters (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946270#comment-16946270
 ] 

Luke Butters edited comment on TIKA-2955 at 10/7/19 9:40 PM:
-

So [wikipedia 
Valid_characters_in_XML|https://en.wikipedia.org/wiki/Valid_characters_in_XML] 
has this to says for XML 1.0 this range is valid:
{quote}
U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters 
in the BMP (all surrogates, U+FFFE and U+ are forbidden);
{quote}
 it goes on to say:
{quote}
The preceding code points ranges contain the following controls which are only 
valid in certain contexts in XML 1.0 documents, and whose usage is restricted 
and highly discouraged:
U+007F–U+0084, U+0086–U+009F: this includes a C0 control character and all 
but one C1 control.
{quote}
I think most of that range is allowed in XML, although discouraged.

Going over to 
https://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#character-references 
it says:
{quote}
The numeric character reference forms described above are allowed to reference 
any Unicode code point other than U+, U+000D, permanently undefined Unicode 
characters (noncharacters), and control characters other than space characters.
{quote}
I think it is trying to say it exclude control characters from those encodings.

Looking at: ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
{code}
007F;;Cc;0;BN;N;DELETE
0080;;Cc;0;BN;N;
0081;;Cc;0;BN;N;
0082;;Cc;0;BN;N;BREAK PERMITTED HERE
0083;;Cc;0;BN;N;NO BREAK HERE
0084;;Cc;0;BN;N;
0085;;Cc;0;B;N;NEXT LINE (NEL)
0086;;Cc;0;BN;N;START OF SELECTED AREA
0087;;Cc;0;BN;N;END OF SELECTED AREA
0088;;Cc;0;BN;N;CHARACTER TABULATION SET
0089;;Cc;0;BN;N;CHARACTER TABULATION WITH JUSTIFICATION
008A;;Cc;0;BN;N;LINE TABULATION SET
008B;;Cc;0;BN;N;PARTIAL LINE FORWARD
008C;;Cc;0;BN;N;PARTIAL LINE BACKWARD
008D;;Cc;0;BN;N;REVERSE LINE FEED
008E;;Cc;0;BN;N;SINGLE SHIFT TWO
008F;;Cc;0;BN;N;SINGLE SHIFT THREE
0090;;Cc;0;BN;N;DEVICE CONTROL STRING
0091;;Cc;0;BN;N;PRIVATE USE ONE
0092;;Cc;0;BN;N;PRIVATE USE TWO
0093;;Cc;0;BN;N;SET TRANSMIT STATE
0094;;Cc;0;BN;N;CANCEL CHARACTER
0095;;Cc;0;BN;N;MESSAGE WAITING
0096;;Cc;0;BN;N;START OF GUARDED AREA
0097;;Cc;0;BN;N;END OF GUARDED AREA
0098;;Cc;0;BN;N;START OF STRING
0099;;Cc;0;BN;N;
009A;;Cc;0;BN;N;SINGLE CHARACTER INTRODUCER
009B;;Cc;0;BN;N;CONTROL SEQUENCE INTRODUCER
009C;;Cc;0;BN;N;STRING TERMINATOR
009D;;Cc;0;BN;N;OPERATING SYSTEM COMMAND
009E;;Cc;0;BN;N;PRIVACY MESSAGE
009F;;Cc;0;BN;N;APPLICATION PROGRAM COMMAND
{code}

I then remembered https://validator.w3.org/nu/#textarea exists and tried out 
{{}} the validator did not like that and said:
{code}
Character reference expands to a control character (U+007f).
{code}

So I think it is invalid only HTML but ok in XML.


was (Author: lukebutters7):
So [wikipedia 
Valid_characters_in_XML|https://en.wikipedia.org/wiki/Valid_characters_in_XML] 
has this to says for XML 1.0 this range is valid:
{quote}
U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters 
in the BMP (all surrogates, U+FFFE and U+ are forbidden);
{quote}
 it goes on to say:
{quote}
The preceding code points ranges contain the following controls which are only 
valid in certain contexts in XML 1.0 documents, and whose usage is restricted 
and highly discouraged:
U+007F–U+0084, U+0086–U+009F: this includes a C0 control character and all 
but one C1 control.
{quote}
I think most of that range is allowed in XML, although discouraged.

Going over to 
https://www.w3.org/TR/2011/WD-html5-20110525/syntax.html#character-references 
it says:
{quote}
The numeric character reference forms described above are allowed to reference 
any Unicode code point other than U+, U+000D, permanently undefined Unicode 
characters (noncharacters), and control characters other than space characters.
{quote}
I think it is trying to say it exclude control characters from those encodings.

Looking at: ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
{code}
007F;;Cc;0;BN;N;DELETE
0080;;Cc;0;BN;N;
0081;;Cc;0;BN;N;
0082;;Cc;0;BN;N;BREAK PERMITTED HERE
0083;;Cc;0;BN;N;NO BREAK HERE
0084;;Cc;0;BN;N;
0085;;Cc;0;B;N;NEXT LINE (NEL)
0086;;Cc;0;BN;N;START OF SELECTED AREA
0087;;Cc;0;BN;N;END OF SELECTED AREA
0088;;Cc;0;BN;N;CHARACTER TABULATION SET
0089;;Cc;0;BN;N;CHARACTER TABULATION WITH JUSTIFICATION
008A;;Cc;0;BN;N;LINE TABULATION SET
008B;;Cc;0;BN;N;PARTIAL LINE FORWARD
008C;;Cc;0;BN;N;PARTIAL LINE BACKWARD
008D;;Cc;0;BN;N;REVERSE LINE FEED

[jira] [Commented] (TIKA-2941) OSGI bundle and app are not self-contained

2019-10-07 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16945889#comment-16945889
 ] 

Bob Paulin commented on TIKA-2941:
--

Just an update to provide some transparency around the "why" we got here.  With 
the newer version of the maven-bundle-plugin when I revert my commit from 
before I do not see the transitive dependencies included if the tika-parsers 
are in provided scope.  With tika-parsers being embedded it does not really 
make sense for it to be in provided scope anyways.  However with tika-parsers 
as a compile time dependency all the transitive dependencies are included in 
maven which is what is being called out as the issue in this JIRA.  The good 
thing from an OSGi perspective we're still OK since only the following packages 
are exported:

 
{code:java}

  !org.apache.tika.parser,
  !org.apache.tika.parser.external,
  org.apache.tika.parser.*,
  org.apache.tika.metadata.serialization.*,
 {code}
 

But the maven side still shows all the transitive dependencies coming through.  
So in an OSGi runtime all these packages are private as expected but in the 
development environment this is a bit confusing since maven shows them coming 
through.  Will need some time to see if we can get the maven side of this 
equation right without breaking the OSGi side.   Hopefully this helps provide 
some context around the problem we're solving.

 

> OSGI bundle and app are not self-contained
> --
>
> Key: TIKA-2941
> URL: https://issues.apache.org/jira/browse/TIKA-2941
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.22
>Reporter: Peng Cheng
>Priority: Major
>
> Tika bundle still have dependencies spilled out of its package and cause jar 
> hell everywhere. If tika bundle is declared in maven as a dependency, a maven 
> dependency:tree will indicate:
> [INFO] | +- org.apache.tika:tika-bundle:jar:1.22:test
>  [INFO] | | +- org.apache.tika:tika-core:jar:1.22:test
>  [INFO] | | - org.apache.tika:tika-parsers:jar:1.22:test
>  [INFO] | | +- org.glassfish.jaxb:jaxb-runtime:jar:2.3.2:test
>  [INFO] | | | +- jakarta.xml.bind:jakarta.xml.bind-api:jar:2.3.2:test
>  [INFO] | | | +- org.glassfish.jaxb:txw2:jar:2.3.2:test
>  [INFO] | | | +- com.sun.istack:istack-commons-runtime:jar:3.0.8:test
>  [INFO] | | | +- org.jvnet.staxex:stax-ex:jar:1.8.1:test
>  [INFO] | | | - com.sun.xml.fastinfoset:FastInfoset:jar:1.2.16:test
>  [INFO] | | +- com.sun.activation:jakarta.activation:jar:1.2.1:test
>  [INFO] | | +- org.gagravarr:vorbis-java-tika:jar:0.8:test
>  [INFO] | | +- org.tallison:jmatio:jar:1.5:test
>  [INFO] | | +- org.apache.james:apache-mime4j-core:jar:0.8.3:test
>  [INFO] | | +- org.apache.james:apache-mime4j-dom:jar:0.8.3:test
>  [INFO] | | +- com.epam:parso:jar:2.0.11:test
>  [INFO] | | +- org.brotli:dec:jar:0.1.2:test
>  [INFO] | | +- org.apache.pdfbox:pdfbox:jar:2.0.16:test
>  [INFO] | | | - org.apache.pdfbox:fontbox:jar:2.0.16:test
>  [INFO] | | +- org.apache.pdfbox:pdfbox-tools:jar:2.0.16:test
>  [INFO] | | +- org.apache.pdfbox:jempbox:jar:1.8.16:test
>  [INFO] | | +- org.bouncycastle:bcmail-jdk15on:jar:1.62:test
>  [INFO] | | | - org.bouncycastle:bcpkix-jdk15on:jar:1.62:test
>  [INFO] | | +- org.bouncycastle:bcprov-jdk15on:jar:1.62:test
>  [INFO] | | +- org.apache.poi:poi:jar:4.0.1:test
>  [INFO] | | | - org.apache.commons:commons-collections4:jar:4.2:test
>  [INFO] | | +- org.apache.poi:poi-scratchpad:jar:4.0.1:test
>  [INFO] | | +- org.apache.poi:poi-ooxml:jar:4.0.1:test
>  [INFO] | | | +- org.apache.poi:poi-ooxml-schemas:jar:4.0.1:test
>  [INFO] | | | | - org.apache.xmlbeans:xmlbeans:jar:3.0.2:test
>  [INFO] | | | - com.github.virtuald:curvesapi:jar:1.05:test
>  [INFO] | | +- com.healthmarketscience.jackcess:jackcess:jar:3.0.1:test
>  [INFO] | | +- 
> com.healthmarketscience.jackcess:jackcess-encrypt:jar:3.0.0:test
>  [INFO] | | +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:test
>  [INFO] | | +- org.ow2.asm:asm:jar:7.2-beta:test
>  [INFO] | | +- com.googlecode.mp4parser:isoparser:jar:1.1.22:test
>  [INFO] | | +- com.drewnoakes:metadata-extractor:jar:2.11.0:test
>  [INFO] | | | - com.adobe.xmp:xmpcore:jar:5.1.3:test
>  [INFO] | | +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:test
>  [INFO] | | +- com.rometools:rome:jar:1.12.1:test
>  [INFO] | | | - com.rometools:rome-utils:jar:1.12.1:test
>  [INFO] | | +- org.gagravarr:vorbis-java-core:jar:0.8:test
>  [INFO] | | +- org.codelibs:jhighlight:jar:1.0.3:test
>  [INFO] | | +- com.pff:java-libpst:jar:0.8.1:test
>  [INFO] | | +- com.github.junrar:junrar:jar:4.0.0:test
>  [INFO] | | +- org.apache.cxf:cxf-rt-rs-client:jar:3.3.2:test
>  [INFO] | | | +- org.apache.cxf:cxf-rt-transports-http:jar:3.3.2:test
>  [INFO] | | | +- org.apache.cxf:cxf-core:jar:3.3.2:test
>  [INFO] | | | | +-