While Nutch 1.6, I could not crawl one particular site and it is giving me
the following error message in the parsing stage. I tried to google this
issue, I tried changing parse.timeout to 3600 and I even tried changing it
to -1, it doesn't seem to make any difference.
Please help.


Error message: Error parsing http://www.####.com/ failed(2,0): XML parse
error

>From the logs:

2013-08-02 10:12:03,446 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-08-02 10:12:03,465 INFO  http.Http - http.proxy.host = null
2013-08-02 10:12:03,466 INFO  http.Http - http.proxy.port = 8080
2013-08-02 10:12:03,466 INFO  http.Http - http.timeout = 240000
2013-08-02 10:12:03,466 INFO  http.Http - http.content.limit = -1
2013-08-02 10:12:03,466 INFO  http.Http - http.agent = Nutch
Spider/Nutch-1.6
2013-08-02 10:12:03,466 INFO  http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2013-08-02 10:12:03,466 INFO  http.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2013-08-02 10:12:03,472 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-08-02 10:12:03,473 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-08-02 10:12:03,476 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-08-02 10:12:03,489 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-08-02 10:12:03,489 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-08-02 10:12:03,610 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=3
2013-08-02 10:12:03,612 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=2
2013-08-02 10:12:03,619 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-08-02 10:12:03,611 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-08-02 10:12:03,623 INFO  fetcher.Fetcher - Using queue mode : byHost
2013-08-02 10:12:03,623 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-08-02 10:12:03,623 INFO  fetcher.Fetcher - Fetcher: throughput
threshold: -1
2013-08-02 10:12:03,623 INFO  fetcher.Fetcher - Fetcher: throughput
threshold retries: 5
2013-08-02 10:12:03,638 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2013-08-02 10:12:04,598 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2013-08-02 10:12:04,631 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2013-08-02 10:12:04,635 INFO  fetcher.Fetcher - -activeThreads=0
2013-08-02 10:12:09,293 INFO  fetcher.Fetcher - Fetcher: finished at
2013-08-02 10:12:09, elapsed: 00:00:07
2013-08-02 10:12:09,296 INFO  parse.ParseSegment - ParseSegment: starting
at 2013-08-02 10:12:09
2013-08-02 10:12:09,296 INFO  parse.ParseSegment - ParseSegment: segment:
crawl-0802-test-3/segments/20130802101154
2013-08-02 10:12:10,335 DEBUG util.ObjectCache - No object cache found for
conf=Configuration: core-default.xml, core-site.xml, mapred-default.xml,
mapred-site.xml,
file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0006.xml,
instantiating a new object cache
2013-08-02 10:12:10,362 INFO  parse.ParserFactory - The parsing plugins:
[org.apache.nutch.parse.tika.TikaParser] are enabled via the
plugin.includes system property, and all claim to support the content type
application/xml, but they are not mapped to it  in the parse-plugins.xml
file
2013-08-02 10:12:11,166 DEBUG parse.ParseUtil - Parsing [
http://www.#####.com/] with [org.apache.nutch.parse.tika.TikaParser@4b3788e1
]
*2013-08-02 10:12:11,168 DEBUG tika.TikaParser - Using Tika parser
org.apache.tika.parser.xml.DcXMLParser for mime-type application/xml*
*2013-08-02 10:12:11,232 ERROR tika.TikaParser - Error parsing
http://www.####.com/
org.apache.tika.exception.TikaException: XML parse error*
    at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
    at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:96)
    at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95)
    at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
    at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
*Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber:
144; The entity name must immediately follow the '&' in the entity
reference.*
    at
org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown
Source)
    at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
    at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
    at org.apache.xerces.impl.XMLScanner.scanAttributeValue(Unknown Source)
    at
org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown
Source)
    at
org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown
Source)
    at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
    at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
    at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
    at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
    at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
    ... 8 more
2013-08-02 10:12:11,246 WARN  parse.ParseSegment -* Error parsing:
http://www.####.com/:
failed(2,0): XML parse error*
2013-08-02 10:12:11,256 INFO  crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2013-08-02 10:12:11,295 INFO  parse.ParseSegment - Parsed (50ms):
http://www.####.com/
2013-08-02 10:12:12,701 DEBUG util.ObjectCache - No object cache found for
conf=Configuration: core-default.xml, core-site.xml, mapred-default.xml,
mapred-site.xml,
file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0006.xml,
instantiating a new object cache
2013-08-02 10:12:16,705 INFO  parse.ParseSegment - ParseSegment: finished
at 2013-08-02 10:12:16, elapsed: 00:00:07
2013-08-02 10:12:16,709 INFO  crawl.CrawlDb - CrawlDb update: starting at
2013-08-02 10:12:16
2013-08-02 10:12:16,711 INFO  crawl.CrawlDb - CrawlDb update: db:
crawl-0802-test-3/crawldb
2013-08-02 10:12:16,711 INFO  crawl.CrawlDb - CrawlDb update: segments:
[crawl-0802-test-3/segments/20130802101154]
2013-08-02 10:12:16,711 INFO  crawl.CrawlDb - CrawlDb update: additions
allowed: true
2013-08-02 10:12:16,712 INFO  crawl.CrawlDb - CrawlDb update: URL
normalizing: true
2013-08-02 10:12:16,712 INFO  crawl.CrawlDb - CrawlDb update: URL
filtering: true
2013-08-02 10:12:16,713 INFO  crawl.CrawlDb - CrawlDb update: 404 purging:
false
2013-08-02 10:12:16,713 INFO  crawl.CrawlDb - CrawlDb update: Merging
segment data into db.
2013-08-02 10:12:17,579 DEBUG util.ObjectCache - No object cache found for
conf=Configuration: core-default.xml, core-site.xml, mapred-default.xml,
mapred-site.xml,
file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0007.xml,
instantiating a new object cache
2013-08-02 10:12:17,594 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'crawldb', using default

Reply via email to