I observed that link for PDF document is placed in iframe page. Is there any
solution for nutch to crawl that internal link
<!--/opencms/resources/UploadCDSCOWeb/2022/acts_and_rules/Drugs and Cosmetics
Act, 1940.pdf -->
<iframe src='' width='' height='' target=''>
Thanks and Regards
Raj Chidara
---- On Thu, 24 Oct 2024 02:12:38 +0530 Hiran Chaudhuri
<[email protected]> wrote ---
When I run the request through curl like so:
curl -v
https://www.cdsco.gov.in/opencms/opencms/system/modules/CDSCO.WEB/elements/download_file_division.jsp?num_id=OTIyNw==
I get this response body:
<!--/opencms/resources/UploadCDSCOWeb/2022/acts_and_rules/Drugs and
Cosmetics Act, 1940.pdf -->
<iframe
src='/opencms/resources/UploadCDSCOWeb/2022/acts_and_rules/Drugs and
Cosmetics Act, 1940.pdf' width='100%' height='100%' target='_blank'>
That is indeed no well-formed XML. The Tika XML parser is right here.
But the content type header indicates something else:
Content-Type: text/html;charset=UTF-8
Why is the XML parser invoked at all?
On 23.10.24 14:27, Raj Chidara wrote:
> Hi
>
> I get following exception when Nutch is parsing this page
> https://www.cdsco.gov.in/opencms/opencms/system/modules/CDSCO.WEB/elements/download_file_division.jsp?num_id=OTIyNw==
>
>
>
>
> I commented this configuration in regex-urlfilter.txt so that this particular
> page will be crawled
>
>
>
> # skip URLs containing certain characters as probable queries, etc.
>
> #-[?*!@=]
>
>
>
> 2024-10-23 17:57:17,822 ERROR o.a.n.p.t.TikaParser [parse-0] Error parsing
> https://www.cdsco.gov.in/opencms/opencms/system/modules/CDSCO.WEB/elements/download_file_division.jsp?num_id=OTIyNw==
>
>
> org.apache.tika.exception.TikaException: XML parse error
>
> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:77) ~[?:?]
>
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> ~[tika-core-2.3.0.jar:2.3.0]
>
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151)
> ~[?:?]
>
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90) ~[?:?]
>
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
> ~[apache-nutch-1.19.jar:?]
>
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
> ~[apache-nutch-1.19.jar:?]
>
> at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> ~[?:?]
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> ~[?:?]
>
> at java.lang.Thread.run(Thread.java:829) ~[?:?]
>
> Caused by: org.xml.sax.SAXParseException: XML document structures must start
> and end within the same entity.
>
> at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown
> Source) ~[xercesImpl-2.12.2.jar:?]
>
> at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
> ~[xercesImpl-2.12.2.jar:?]
>
> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
> ~[xercesImpl-2.12.2.jar:2.12.2]
>
> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
> ~[xercesImpl-2.12.2.jar:2.12.2]
>
> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
> ~[xercesImpl-2.12.2.jar:2.12.2]
>
> at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
> ~[xercesImpl-2.12.2.jar:2.12.2]
>
> at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.endEntity(Unknown
> Source) ~[xercesImpl-2.12.2.jar:2.12.2]
>
> at org.apache.xerces.impl.XMLDocumentScannerImpl.endEntity(Unknown Source)
> ~[xercesImpl-2.12.2.jar:2.12.2]
>
> at org.apache.xerces.impl.XMLEntityManager.endEntity(Unknown Source)
> ~[xercesImpl-2.12.2.jar:2.12.2]
>
> at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
> ~[xercesImpl-2.12.2.jar:2.12.2]
>
> at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
> ~[xercesImpl-2.12.2.jar:2.12.2]
>
> at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
> Source) ~[xercesImpl-2.12.2.jar:2.12.2]
>
> at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
> Source) ~[xercesImpl-2.12.2.jar:2.12.2]
>
> at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
> Source) ~[xercesImpl-2.12.2.jar:2.12.2]
>
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> ~[xercesImpl-2.12.2.jar:?]
>
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> ~[xercesImpl-2.12.2.jar:?]
>
> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
> ~[xercesImpl-2.12.2.jar:?]
>
> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
> ~[xercesImpl-2.12.2.jar:?]
>
> at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
> ~[xercesImpl-2.12.2.jar:?]
>
> at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
> ~[xercesImpl-2.12.2.jar:?]
>
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:197) ~[?:?]
>
> at org.apache.tika.utils.XMLReaderUtils.parseSAX(XMLReaderUtils.java:479)
> ~[tika-core-2.3.0.jar:2.3.0]
>
> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:71) ~[?:?]
>
>
>
> Thanks and Regards
>
> Raj Chidara
>
>
>
>
> Global Locations:
>
> USA | UK | India | Singapore | Japan
>
> *ISO 9001, 27001, 20000 Compliant
>
> www.DDIsmart.com
>
> About Us | Awards | Blog | News | Contact Us
>
>
>
>
>
>
>
>
>
> DISCLAIMER: This message is intended solely for the use of the individual or
> entity to which it is addressed. If you are not the intended recipient, you
> should not use, copy, alter, or disclose the contents of this message. All
> information or opinions expressed in this message and/or any attachments are
> those of the author and are not necessarily those of the group companies.
>
>
>
>
>
Global Locations:
USA | UK | India | Singapore | Japan
*ISO 9001, 27001, 20000 Compliant
www.DDIsmart.com
About Us | Awards | Blog | News | Contact Us
DISCLAIMER: This message is intended solely for the use of the individual or
entity to which it is addressed. If you are not the intended recipient, you
should not use, copy, alter, or disclose the contents of this message. All
information or opinions expressed in this message and/or any attachments are
those of the author and are not necessarily those of the group companies.