While Nutch 1.6, I could not crawl one particular site and it is giving me the following error message in the parsing stage. I tried to google this issue, I tried changing parse.timeout to 3600 and I even tried changing it to -1, it doesn't seem to make any difference. Please help.
Error message: Error parsing http://www.####.com/ failed(2,0): XML parse error >From the logs: 2013-08-02 10:12:03,446 INFO fetcher.Fetcher - Using queue mode : byHost 2013-08-02 10:12:03,465 INFO http.Http - http.proxy.host = null 2013-08-02 10:12:03,466 INFO http.Http - http.proxy.port = 8080 2013-08-02 10:12:03,466 INFO http.Http - http.timeout = 240000 2013-08-02 10:12:03,466 INFO http.Http - http.content.limit = -1 2013-08-02 10:12:03,466 INFO http.Http - http.agent = Nutch Spider/Nutch-1.6 2013-08-02 10:12:03,466 INFO http.Http - http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 2013-08-02 10:12:03,466 INFO http.Http - http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 2013-08-02 10:12:03,472 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-08-02 10:12:03,473 INFO fetcher.Fetcher - Using queue mode : byHost 2013-08-02 10:12:03,476 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-08-02 10:12:03,489 INFO fetcher.Fetcher - Using queue mode : byHost 2013-08-02 10:12:03,489 INFO fetcher.Fetcher - Using queue mode : byHost 2013-08-02 10:12:03,610 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=3 2013-08-02 10:12:03,612 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=2 2013-08-02 10:12:03,619 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-08-02 10:12:03,611 INFO fetcher.Fetcher - Using queue mode : byHost 2013-08-02 10:12:03,623 INFO fetcher.Fetcher - Using queue mode : byHost 2013-08-02 10:12:03,623 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-08-02 10:12:03,623 INFO fetcher.Fetcher - Fetcher: throughput threshold: -1 2013-08-02 10:12:03,623 INFO fetcher.Fetcher - Fetcher: throughput threshold retries: 5 2013-08-02 10:12:03,638 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2013-08-02 10:12:04,598 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0 2013-08-02 10:12:04,631 INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 2013-08-02 10:12:04,635 INFO fetcher.Fetcher - -activeThreads=0 2013-08-02 10:12:09,293 INFO fetcher.Fetcher - Fetcher: finished at 2013-08-02 10:12:09, elapsed: 00:00:07 2013-08-02 10:12:09,296 INFO parse.ParseSegment - ParseSegment: starting at 2013-08-02 10:12:09 2013-08-02 10:12:09,296 INFO parse.ParseSegment - ParseSegment: segment: crawl-0802-test-3/segments/20130802101154 2013-08-02 10:12:10,335 DEBUG util.ObjectCache - No object cache found for conf=Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0006.xml, instantiating a new object cache 2013-08-02 10:12:10,362 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/xml, but they are not mapped to it in the parse-plugins.xml file 2013-08-02 10:12:11,166 DEBUG parse.ParseUtil - Parsing [ http://www.#####.com/] with [org.apache.nutch.parse.tika.TikaParser@4b3788e1 ] *2013-08-02 10:12:11,168 DEBUG tika.TikaParser - Using Tika parser org.apache.tika.parser.xml.DcXMLParser for mime-type application/xml* *2013-08-02 10:12:11,232 ERROR tika.TikaParser - Error parsing http://www.####.com/ org.apache.tika.exception.TikaException: XML parse error* at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78) at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:96) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) *Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 144; The entity name must immediately follow the '&' in the entity reference.* at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source) at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source) at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source) at org.apache.xerces.impl.XMLScanner.scanAttributeValue(Unknown Source) at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown Source) at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72) ... 8 more 2013-08-02 10:12:11,246 WARN parse.ParseSegment -* Error parsing: http://www.####.com/: failed(2,0): XML parse error* 2013-08-02 10:12:11,256 INFO crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature 2013-08-02 10:12:11,295 INFO parse.ParseSegment - Parsed (50ms): http://www.####.com/ 2013-08-02 10:12:12,701 DEBUG util.ObjectCache - No object cache found for conf=Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0006.xml, instantiating a new object cache 2013-08-02 10:12:16,705 INFO parse.ParseSegment - ParseSegment: finished at 2013-08-02 10:12:16, elapsed: 00:00:07 2013-08-02 10:12:16,709 INFO crawl.CrawlDb - CrawlDb update: starting at 2013-08-02 10:12:16 2013-08-02 10:12:16,711 INFO crawl.CrawlDb - CrawlDb update: db: crawl-0802-test-3/crawldb 2013-08-02 10:12:16,711 INFO crawl.CrawlDb - CrawlDb update: segments: [crawl-0802-test-3/segments/20130802101154] 2013-08-02 10:12:16,711 INFO crawl.CrawlDb - CrawlDb update: additions allowed: true 2013-08-02 10:12:16,712 INFO crawl.CrawlDb - CrawlDb update: URL normalizing: true 2013-08-02 10:12:16,712 INFO crawl.CrawlDb - CrawlDb update: URL filtering: true 2013-08-02 10:12:16,713 INFO crawl.CrawlDb - CrawlDb update: 404 purging: false 2013-08-02 10:12:16,713 INFO crawl.CrawlDb - CrawlDb update: Merging segment data into db. 2013-08-02 10:12:17,579 DEBUG util.ObjectCache - No object cache found for conf=Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0007.xml, instantiating a new object cache 2013-08-02 10:12:17,594 INFO regex.RegexURLNormalizer - can't find rules for scope 'crawldb', using default

