I could solve my issue. I am not sure if this was fixed in 1.7 or not. But with Nutch 1.6, all I did was "application/xml" in file plugins/parse-html/plugin.xml -> <parameter name=contentType" value="text/html|application/xhtml+xml|*application/xml*" />. That fixed my issue. Hopefully it should help someone with the same problem.
On Fri, Aug 2, 2013 at 10:48 AM, A Laxmi <[email protected]> wrote: > While Nutch 1.6, I could not crawl one particular site and it is giving me > the following error message in the parsing stage. I tried to google this > issue, I tried changing parse.timeout to 3600 and I even tried changing it > to -1, it doesn't seem to make any difference. > Please help. > > > Error message: Error parsing http://www.####.com/ failed(2,0): XML parse > error > > From the logs: > > 2013-08-02 10:12:03,446 INFO fetcher.Fetcher - Using queue mode : byHost > 2013-08-02 10:12:03,465 INFO http.Http - http.proxy.host = null > 2013-08-02 10:12:03,466 INFO http.Http - http.proxy.port = 8080 > 2013-08-02 10:12:03,466 INFO http.Http - http.timeout = 240000 > 2013-08-02 10:12:03,466 INFO http.Http - http.content.limit = -1 > 2013-08-02 10:12:03,466 INFO http.Http - http.agent = Nutch > Spider/Nutch-1.6 > 2013-08-02 10:12:03,466 INFO http.Http - http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > 2013-08-02 10:12:03,466 INFO http.Http - http.accept = > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > 2013-08-02 10:12:03,472 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2013-08-02 10:12:03,473 INFO fetcher.Fetcher - Using queue mode : byHost > 2013-08-02 10:12:03,476 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2013-08-02 10:12:03,489 INFO fetcher.Fetcher - Using queue mode : byHost > 2013-08-02 10:12:03,489 INFO fetcher.Fetcher - Using queue mode : byHost > 2013-08-02 10:12:03,610 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=3 > 2013-08-02 10:12:03,612 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=2 > 2013-08-02 10:12:03,619 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2013-08-02 10:12:03,611 INFO fetcher.Fetcher - Using queue mode : byHost > 2013-08-02 10:12:03,623 INFO fetcher.Fetcher - Using queue mode : byHost > 2013-08-02 10:12:03,623 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2013-08-02 10:12:03,623 INFO fetcher.Fetcher - Fetcher: throughput > threshold: -1 > 2013-08-02 10:12:03,623 INFO fetcher.Fetcher - Fetcher: throughput > threshold retries: 5 > 2013-08-02 10:12:03,638 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2013-08-02 10:12:04,598 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=0 > 2013-08-02 10:12:04,631 INFO fetcher.Fetcher - -activeThreads=0, > spinWaiting=0, fetchQueues.totalSize=0 > 2013-08-02 10:12:04,635 INFO fetcher.Fetcher - -activeThreads=0 > 2013-08-02 10:12:09,293 INFO fetcher.Fetcher - Fetcher: finished at > 2013-08-02 10:12:09, elapsed: 00:00:07 > 2013-08-02 10:12:09,296 INFO parse.ParseSegment - ParseSegment: starting > at 2013-08-02 10:12:09 > 2013-08-02 10:12:09,296 INFO parse.ParseSegment - ParseSegment: segment: > crawl-0802-test-3/segments/20130802101154 > 2013-08-02 10:12:10,335 DEBUG util.ObjectCache - No object cache found for > conf=Configuration: core-default.xml, core-site.xml, mapred-default.xml, > mapred-site.xml, > file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0006.xml, > instantiating a new object cache > 2013-08-02 10:12:10,362 INFO parse.ParserFactory - The parsing plugins: > [org.apache.nutch.parse.tika.TikaParser] are enabled via the > plugin.includes system property, and all claim to support the content type > application/xml, but they are not mapped to it in the parse-plugins.xml > file > 2013-08-02 10:12:11,166 DEBUG parse.ParseUtil - Parsing [ > http://www.#####.com/] with > [org.apache.nutch.parse.tika.TikaParser@4b3788e1] > *2013-08-02 10:12:11,168 DEBUG tika.TikaParser - Using Tika parser > org.apache.tika.parser.xml.DcXMLParser for mime-type application/xml* > *2013-08-02 10:12:11,232 ERROR tika.TikaParser - Error parsing > http://www.####.com/ > org.apache.tika.exception.TikaException: XML parse error* > at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78) > at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:96) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > *Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: > 144; The entity name must immediately follow the '&' in the entity > reference.* > at > org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown > Source) > at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown > Source) > at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) > at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) > at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) > at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source) > at org.apache.xerces.impl.XMLScanner.scanAttributeValue(Unknown Source) > at > org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown > Source) > at > org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown > Source) > at > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown > Source) > at > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown > Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) > at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown > Source) > at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) > at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72) > ... 8 more > 2013-08-02 10:12:11,246 WARN parse.ParseSegment -* Error parsing: > http://www.####.com/: > failed(2,0): XML parse error* > 2013-08-02 10:12:11,256 INFO crawl.SignatureFactory - Using Signature > impl: org.apache.nutch.crawl.MD5Signature > 2013-08-02 10:12:11,295 INFO parse.ParseSegment - Parsed (50ms): > http://www.####.com/ > 2013-08-02 10:12:12,701 DEBUG util.ObjectCache - No object cache found for > conf=Configuration: core-default.xml, core-site.xml, mapred-default.xml, > mapred-site.xml, > file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0006.xml, > instantiating a new object cache > 2013-08-02 10:12:16,705 INFO parse.ParseSegment - ParseSegment: finished > at 2013-08-02 10:12:16, elapsed: 00:00:07 > 2013-08-02 10:12:16,709 INFO crawl.CrawlDb - CrawlDb update: starting at > 2013-08-02 10:12:16 > 2013-08-02 10:12:16,711 INFO crawl.CrawlDb - CrawlDb update: db: > crawl-0802-test-3/crawldb > 2013-08-02 10:12:16,711 INFO crawl.CrawlDb - CrawlDb update: segments: > [crawl-0802-test-3/segments/20130802101154] > 2013-08-02 10:12:16,711 INFO crawl.CrawlDb - CrawlDb update: additions > allowed: true > 2013-08-02 10:12:16,712 INFO crawl.CrawlDb - CrawlDb update: URL > normalizing: true > 2013-08-02 10:12:16,712 INFO crawl.CrawlDb - CrawlDb update: URL > filtering: true > 2013-08-02 10:12:16,713 INFO crawl.CrawlDb - CrawlDb update: 404 purging: > false > 2013-08-02 10:12:16,713 INFO crawl.CrawlDb - CrawlDb update: Merging > segment data into db. > 2013-08-02 10:12:17,579 DEBUG util.ObjectCache - No object cache found for > conf=Configuration: core-default.xml, core-site.xml, mapred-default.xml, > mapred-site.xml, > file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0007.xml, > instantiating a new object cache > 2013-08-02 10:12:17,594 INFO regex.RegexURLNormalizer - can't find rules > for scope 'crawldb', using default > >

