Following message is logged before the actual error given
o.a.n.p.ParserFactory [LocalJobRunner Map Task Executor #0] The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/xml, but they are not mapped to it in the parse-plugins.xml file Thanks and Regards Raj Chidara ---- On Wed, 23 Oct 2024 17:57:59 +0530 Raj Chidara <[email protected]> wrote --- Hi I get following exception when Nutch is parsing this page https://www.cdsco.gov.in/opencms/opencms/system/modules/CDSCO.WEB/elements/download_file_division.jsp?num_id=OTIyNw== I commented this configuration in regex-urlfilter.txt so that this particular page will be crawled # skip URLs containing certain characters as probable queries, etc. #-[?*!@=] 2024-10-23 17:57:17,822 ERROR o.a.n.p.t.TikaParser [parse-0] Error parsing https://www.cdsco.gov.in/opencms/opencms/system/modules/CDSCO.WEB/elements/download_file_division.jsp?num_id=OTIyNw== org.apache.tika.exception.TikaException: XML parse error at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:77) ~[?:?] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289) ~[tika-core-2.3.0.jar:2.3.0] at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151) ~[?:?] at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90) ~[?:?] at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34) ~[apache-nutch-1.19.jar:?] at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23) ~[apache-nutch-1.19.jar:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:829) ~[?:?] Caused by: org.xml.sax.SAXParseException: XML document structures must start and end within the same entity. at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source) ~[xercesImpl-2.12.2.jar:?] at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source) ~[xercesImpl-2.12.2.jar:?] at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) ~[xercesImpl-2.12.2.jar:2.12.2] at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) ~[xercesImpl-2.12.2.jar:2.12.2] at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) ~[xercesImpl-2.12.2.jar:2.12.2] at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source) ~[xercesImpl-2.12.2.jar:2.12.2] at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.endEntity(Unknown Source) ~[xercesImpl-2.12.2.jar:2.12.2] at org.apache.xerces.impl.XMLDocumentScannerImpl.endEntity(Unknown Source) ~[xercesImpl-2.12.2.jar:2.12.2] at org.apache.xerces.impl.XMLEntityManager.endEntity(Unknown Source) ~[xercesImpl-2.12.2.jar:2.12.2] at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) ~[xercesImpl-2.12.2.jar:2.12.2] at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) ~[xercesImpl-2.12.2.jar:2.12.2] at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) ~[xercesImpl-2.12.2.jar:2.12.2] at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) ~[xercesImpl-2.12.2.jar:2.12.2] at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) ~[xercesImpl-2.12.2.jar:2.12.2] at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) ~[xercesImpl-2.12.2.jar:?] at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) ~[xercesImpl-2.12.2.jar:?] at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) ~[xercesImpl-2.12.2.jar:?] at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) ~[xercesImpl-2.12.2.jar:?] at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) ~[xercesImpl-2.12.2.jar:?] at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) ~[xercesImpl-2.12.2.jar:?] at javax.xml.parsers.SAXParser.parse(SAXParser.java:197) ~[?:?] at org.apache.tika.utils.XMLReaderUtils.parseSAX(XMLReaderUtils.java:479) ~[tika-core-2.3.0.jar:2.3.0] at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:71) ~[?:?] Thanks and Regards Raj Chidara Global Locations: USA | UK | India | Singapore | Japan *ISO 9001, 27001, 20000 Compliant www.DDIsmart.com About Us | Awards | Blog | News | Contact Us DISCLAIMER: This message is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you should not use, copy, alter, or disclose the contents of this message. All information or opinions expressed in this message and/or any attachments are those of the author and are not necessarily those of the group companies. Global Locations: USA | UK | India | Singapore | Japan *ISO 9001, 27001, 20000 Compliant www.DDIsmart.com About Us | Awards | Blog | News | Contact Us DISCLAIMER: This message is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you should not use, copy, alter, or disclose the contents of this message. All information or opinions expressed in this message and/or any attachments are those of the author and are not necessarily those of the group companies.

