*all I did was add "application/xml" in file plugins/parse-html/plugin.xml
On Fri, Aug 2, 2013 at 3:22 PM, A Laxmi <[email protected]> wrote: > I could solve my issue. I am not sure if this was fixed in 1.7 or not. But > with Nutch 1.6, all I did was "application/xml" in file > plugins/parse-html/plugin.xml -> <parameter name=contentType" > value="text/html|application/xhtml+xml|*application/xml*" />. That fixed > my issue. Hopefully it should help someone with the same problem. > > > On Fri, Aug 2, 2013 at 10:48 AM, A Laxmi <[email protected]> wrote: > >> While Nutch 1.6, I could not crawl one particular site and it is giving >> me the following error message in the parsing stage. I tried to google this >> issue, I tried changing parse.timeout to 3600 and I even tried changing it >> to -1, it doesn't seem to make any difference. >> Please help. >> >> >> Error message: Error parsing http://www.####.com/ failed(2,0): XML parse >> error >> >> From the logs: >> >> 2013-08-02 10:12:03,446 INFO fetcher.Fetcher - Using queue mode : byHost >> 2013-08-02 10:12:03,465 INFO http.Http - http.proxy.host = null >> 2013-08-02 10:12:03,466 INFO http.Http - http.proxy.port = 8080 >> 2013-08-02 10:12:03,466 INFO http.Http - http.timeout = 240000 >> 2013-08-02 10:12:03,466 INFO http.Http - http.content.limit = -1 >> 2013-08-02 10:12:03,466 INFO http.Http - http.agent = Nutch >> Spider/Nutch-1.6 >> 2013-08-02 10:12:03,466 INFO http.Http - http.accept.language = >> en-us,en-gb,en;q=0.7,*;q=0.3 >> 2013-08-02 10:12:03,466 INFO http.Http - http.accept = >> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 >> 2013-08-02 10:12:03,472 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=1 >> 2013-08-02 10:12:03,473 INFO fetcher.Fetcher - Using queue mode : byHost >> 2013-08-02 10:12:03,476 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=1 >> 2013-08-02 10:12:03,489 INFO fetcher.Fetcher - Using queue mode : byHost >> 2013-08-02 10:12:03,489 INFO fetcher.Fetcher - Using queue mode : byHost >> 2013-08-02 10:12:03,610 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=3 >> 2013-08-02 10:12:03,612 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=2 >> 2013-08-02 10:12:03,619 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=1 >> 2013-08-02 10:12:03,611 INFO fetcher.Fetcher - Using queue mode : byHost >> 2013-08-02 10:12:03,623 INFO fetcher.Fetcher - Using queue mode : byHost >> 2013-08-02 10:12:03,623 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=1 >> 2013-08-02 10:12:03,623 INFO fetcher.Fetcher - Fetcher: throughput >> threshold: -1 >> 2013-08-02 10:12:03,623 INFO fetcher.Fetcher - Fetcher: throughput >> threshold retries: 5 >> 2013-08-02 10:12:03,638 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=1 >> 2013-08-02 10:12:04,598 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=0 >> 2013-08-02 10:12:04,631 INFO fetcher.Fetcher - -activeThreads=0, >> spinWaiting=0, fetchQueues.totalSize=0 >> 2013-08-02 10:12:04,635 INFO fetcher.Fetcher - -activeThreads=0 >> 2013-08-02 10:12:09,293 INFO fetcher.Fetcher - Fetcher: finished at >> 2013-08-02 10:12:09, elapsed: 00:00:07 >> 2013-08-02 10:12:09,296 INFO parse.ParseSegment - ParseSegment: starting >> at 2013-08-02 10:12:09 >> 2013-08-02 10:12:09,296 INFO parse.ParseSegment - ParseSegment: segment: >> crawl-0802-test-3/segments/20130802101154 >> 2013-08-02 10:12:10,335 DEBUG util.ObjectCache - No object cache found >> for conf=Configuration: core-default.xml, core-site.xml, >> mapred-default.xml, mapred-site.xml, >> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0006.xml, >> instantiating a new object cache >> 2013-08-02 10:12:10,362 INFO parse.ParserFactory - The parsing plugins: >> [org.apache.nutch.parse.tika.TikaParser] are enabled via the >> plugin.includes system property, and all claim to support the content type >> application/xml, but they are not mapped to it in the parse-plugins.xml >> file >> 2013-08-02 10:12:11,166 DEBUG parse.ParseUtil - Parsing [ >> http://www.#####.com/] with >> [org.apache.nutch.parse.tika.TikaParser@4b3788e1] >> *2013-08-02 10:12:11,168 DEBUG tika.TikaParser - Using Tika parser >> org.apache.tika.parser.xml.DcXMLParser for mime-type application/xml* >> *2013-08-02 10:12:11,232 ERROR tika.TikaParser - Error parsing >> http://www.####.com/ >> org.apache.tika.exception.TikaException: XML parse error* >> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78) >> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:96) >> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95) >> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) >> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) >> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) >> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) >> at >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) >> *Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: >> 144; The entity name must immediately follow the '&' in the entity >> reference.* >> at >> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown >> Source) >> at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown >> Source) >> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) >> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) >> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) >> at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source) >> at org.apache.xerces.impl.XMLScanner.scanAttributeValue(Unknown >> Source) >> at >> org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown >> Source) >> at >> org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown >> Source) >> at >> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown >> Source) >> at >> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown >> Source) >> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) >> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) >> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) >> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) >> at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown >> Source) >> at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) >> at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) >> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72) >> ... 8 more >> 2013-08-02 10:12:11,246 WARN parse.ParseSegment -* Error parsing: >> http://www.####.com/: >> failed(2,0): XML parse error* >> 2013-08-02 10:12:11,256 INFO crawl.SignatureFactory - Using Signature >> impl: org.apache.nutch.crawl.MD5Signature >> 2013-08-02 10:12:11,295 INFO parse.ParseSegment - Parsed (50ms): >> http://www.####.com/ >> 2013-08-02 10:12:12,701 DEBUG util.ObjectCache - No object cache found >> for conf=Configuration: core-default.xml, core-site.xml, >> mapred-default.xml, mapred-site.xml, >> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0006.xml, >> instantiating a new object cache >> 2013-08-02 10:12:16,705 INFO parse.ParseSegment - ParseSegment: finished >> at 2013-08-02 10:12:16, elapsed: 00:00:07 >> 2013-08-02 10:12:16,709 INFO crawl.CrawlDb - CrawlDb update: starting at >> 2013-08-02 10:12:16 >> 2013-08-02 10:12:16,711 INFO crawl.CrawlDb - CrawlDb update: db: >> crawl-0802-test-3/crawldb >> 2013-08-02 10:12:16,711 INFO crawl.CrawlDb - CrawlDb update: segments: >> [crawl-0802-test-3/segments/20130802101154] >> 2013-08-02 10:12:16,711 INFO crawl.CrawlDb - CrawlDb update: additions >> allowed: true >> 2013-08-02 10:12:16,712 INFO crawl.CrawlDb - CrawlDb update: URL >> normalizing: true >> 2013-08-02 10:12:16,712 INFO crawl.CrawlDb - CrawlDb update: URL >> filtering: true >> 2013-08-02 10:12:16,713 INFO crawl.CrawlDb - CrawlDb update: 404 >> purging: false >> 2013-08-02 10:12:16,713 INFO crawl.CrawlDb - CrawlDb update: Merging >> segment data into db. >> 2013-08-02 10:12:17,579 DEBUG util.ObjectCache - No object cache found >> for conf=Configuration: core-default.xml, core-site.xml, >> mapred-default.xml, mapred-site.xml, >> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0007.xml, >> instantiating a new object cache >> 2013-08-02 10:12:17,594 INFO regex.RegexURLNormalizer - can't find rules >> for scope 'crawldb', using default >> >> >

