[
https://issues.apache.org/jira/browse/TIKA-185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662291#action_12662291
]
Andrzej Rusin commented on TIKA-185:
------------------------------------
I agree with Jukka that we should prevent accessing the external entities.
Reasons:
- we can by no means guarantee the existence of external entities for "flat"
files,
- the situation gets even worse with xmls inside zip files,
- and the strange protocols like chrome.
I was trying to fix the problem by adding a EntityResolver, but it did not work
for me. Maybe my DummyEntityResolver was not smart enough ;)
SAXParser parser = factory.newSAXParser();
parser.getXMLReader().setEntityResolver(new DummyEntityResolver());
parser.parse(....
Also, I experimented with SaxParserFactory features like this:
factory.setFeature("http://xml.org/sax/features/external-general-entities",
false);
factory.setFeature("http://xml.org/sax/features/external-parameter-entities",
false);
The exception became different, but the entities still did not resolve.
> XML files with (unsatisfied) SYSTEM entities can not be indexed
> ---------------------------------------------------------------
>
> Key: TIKA-185
> URL: https://issues.apache.org/jira/browse/TIKA-185
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.2
> Reporter: Andrzej Rusin
> Priority: Minor
> Attachments: xmlTest.xml, xmlTest2.xml
>
>
> When trying to extract an XPI file (Firefox extenstion, which probably is not
> a best candidate for extract) I got the below exception.
> It was caused by SYSTEM entities refering the chrome:// protocol.
> However, obviously any XML file that contains SYSTEM entities which can not
> be accessed at the time of extraction will not be extracted properly.
> Here is the stack trace:
> java.net.MalformedURLException: unknown protocol: chrome
> at java.net.URL.<init>(URL.java:574)
> at java.net.URL.<init>(URL.java:464)
> at java.net.URL.<init>(URL.java:413)
> at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown
> Source)
> at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
> at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
> at org.apache.xerces.impl.XMLDTDScannerImpl.startPE(Unknown Source)
> at org.apache.xerces.impl.XMLDTDScannerImpl.skipSeparator(Unknown Source)
> at org.apache.xerces.impl.XMLDTDScannerImpl.scanDecls(Unknown Source)
> at org.apache.xerces.impl.XMLDTDScannerImpl.scanDTDInternalSubset(Unknown
> Source)
> at
> org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown
> Source)
> at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
> Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:57)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80)
> at
> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:93)
> at org.apache.tika.parser.pkg.ZipParser.parse(ZipParser.java:56)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.