Following message is logged before the actual error given


o.a.n.p.ParserFactory [LocalJobRunner Map Task Executor #0] The parsing 
plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the 
plugin.includes system property, and all claim to support the content type 
application/xml, but they are not mapped to it  in the parse-plugins.xml file



Thanks and Regards

Raj Chidara








---- On Wed, 23 Oct 2024 17:57:59 +0530 Raj Chidara <[email protected]> 
wrote ---



Hi 
 
 I get following exception when Nutch is parsing this page 
https://www.cdsco.gov.in/opencms/opencms/system/modules/CDSCO.WEB/elements/download_file_division.jsp?num_id=OTIyNw==
 
 
 
 
I commented this configuration in regex-urlfilter.txt so that this particular 
page will be crawled 
 
 
 
# skip URLs containing certain characters as probable queries, etc. 
 
#-[?*!@=] 
 
 
 
2024-10-23 17:57:17,822 ERROR o.a.n.p.t.TikaParser [parse-0] Error parsing 
https://www.cdsco.gov.in/opencms/opencms/system/modules/CDSCO.WEB/elements/download_file_division.jsp?num_id=OTIyNw==
 
 
org.apache.tika.exception.TikaException: XML parse error 
 
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:77) ~[?:?] 
 
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) 
~[tika-core-2.3.0.jar:2.3.0] 
 
at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151) ~[?:?] 
 
at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90) ~[?:?] 
 
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34) 
~[apache-nutch-1.19.jar:?] 
 
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23) 
~[apache-nutch-1.19.jar:?] 
 
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] 
 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
~[?:?] 
 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
~[?:?] 
 
at java.lang.Thread.run(Thread.java:829) ~[?:?] 
 
Caused by: org.xml.sax.SAXParseException: XML document structures must start 
and end within the same entity. 
 
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
Source) ~[xercesImpl-2.12.2.jar:?] 
 
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source) 
~[xercesImpl-2.12.2.jar:?] 
 
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) 
~[xercesImpl-2.12.2.jar:2.12.2] 
 
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) 
~[xercesImpl-2.12.2.jar:2.12.2] 
 
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) 
~[xercesImpl-2.12.2.jar:2.12.2] 
 
at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source) 
~[xercesImpl-2.12.2.jar:2.12.2] 
 
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.endEntity(Unknown 
Source) ~[xercesImpl-2.12.2.jar:2.12.2] 
 
at org.apache.xerces.impl.XMLDocumentScannerImpl.endEntity(Unknown Source) 
~[xercesImpl-2.12.2.jar:2.12.2] 
 
at org.apache.xerces.impl.XMLEntityManager.endEntity(Unknown Source) 
~[xercesImpl-2.12.2.jar:2.12.2] 
 
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) 
~[xercesImpl-2.12.2.jar:2.12.2] 
 
at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) 
~[xercesImpl-2.12.2.jar:2.12.2] 
 
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown 
Source) ~[xercesImpl-2.12.2.jar:2.12.2] 
 
at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
 Source) ~[xercesImpl-2.12.2.jar:2.12.2] 
 
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source) ~[xercesImpl-2.12.2.jar:2.12.2] 
 
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) 
~[xercesImpl-2.12.2.jar:?] 
 
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) 
~[xercesImpl-2.12.2.jar:?] 
 
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) 
~[xercesImpl-2.12.2.jar:?] 
 
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) 
~[xercesImpl-2.12.2.jar:?] 
 
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) 
~[xercesImpl-2.12.2.jar:?] 
 
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) 
~[xercesImpl-2.12.2.jar:?] 
 
at javax.xml.parsers.SAXParser.parse(SAXParser.java:197) ~[?:?] 
 
at org.apache.tika.utils.XMLReaderUtils.parseSAX(XMLReaderUtils.java:479) 
~[tika-core-2.3.0.jar:2.3.0] 
 
at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:71) ~[?:?] 
 
 
 
Thanks and Regards 
 
Raj Chidara 
 
 
 
 
Global Locations: 
 
USA | UK | India | Singapore | Japan 
 
*ISO 9001, 27001, 20000 Compliant 
 
www.DDIsmart.com 
 
About Us | Awards | Blog | News | Contact Us 
 
 
 
 
 
 
 
 
 
DISCLAIMER: This message is intended solely for the use of the individual or 
entity to which it is addressed. If you are not the intended recipient, you 
should not use, copy, alter, or disclose the contents of this message. All 
information or opinions expressed in this message and/or any attachments are 
those of the author and are not necessarily those of the group companies.

 
 
 
Global Locations:
 
USA | UK | India | Singapore | Japan
 
*ISO 9001, 27001, 20000 Compliant
 
www.DDIsmart.com
 
About Us | Awards | Blog | News | Contact Us
 
 
 



 
 
 
DISCLAIMER: This message is intended solely for the use of the individual or 
entity to which it is addressed. If you are not the intended recipient, you 
should not use, copy, alter, or disclose the contents of this message. All 
information or opinions expressed in this message and/or any attachments are 
those of the author and are not necessarily those of the group companies.
 



Reply via email to