*all I did was add "application/xml" in file plugins/parse-html/plugin.xml


On Fri, Aug 2, 2013 at 3:22 PM, A Laxmi <[email protected]> wrote:

> I could solve my issue. I am not sure if this was fixed in 1.7 or not. But
> with Nutch 1.6, all I did was "application/xml" in file
> plugins/parse-html/plugin.xml -> <parameter name=contentType"
> value="text/html|application/xhtml+xml|*application/xml*" />. That fixed
> my issue. Hopefully it should help someone with the same problem.
>
>
> On Fri, Aug 2, 2013 at 10:48 AM, A Laxmi <[email protected]> wrote:
>
>> While Nutch 1.6, I could not crawl one particular site and it is giving
>> me the following error message in the parsing stage. I tried to google this
>> issue, I tried changing parse.timeout to 3600 and I even tried changing it
>> to -1, it doesn't seem to make any difference.
>> Please help.
>>
>>
>> Error message: Error parsing http://www.####.com/ failed(2,0): XML parse
>> error
>>
>> From the logs:
>>
>> 2013-08-02 10:12:03,446 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2013-08-02 10:12:03,465 INFO  http.Http - http.proxy.host = null
>> 2013-08-02 10:12:03,466 INFO  http.Http - http.proxy.port = 8080
>> 2013-08-02 10:12:03,466 INFO  http.Http - http.timeout = 240000
>> 2013-08-02 10:12:03,466 INFO  http.Http - http.content.limit = -1
>> 2013-08-02 10:12:03,466 INFO  http.Http - http.agent = Nutch
>> Spider/Nutch-1.6
>> 2013-08-02 10:12:03,466 INFO  http.Http - http.accept.language =
>> en-us,en-gb,en;q=0.7,*;q=0.3
>> 2013-08-02 10:12:03,466 INFO  http.Http - http.accept =
>> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>> 2013-08-02 10:12:03,472 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2013-08-02 10:12:03,473 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2013-08-02 10:12:03,476 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2013-08-02 10:12:03,489 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2013-08-02 10:12:03,489 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2013-08-02 10:12:03,610 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=3
>> 2013-08-02 10:12:03,612 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=2
>> 2013-08-02 10:12:03,619 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2013-08-02 10:12:03,611 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2013-08-02 10:12:03,623 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2013-08-02 10:12:03,623 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2013-08-02 10:12:03,623 INFO  fetcher.Fetcher - Fetcher: throughput
>> threshold: -1
>> 2013-08-02 10:12:03,623 INFO  fetcher.Fetcher - Fetcher: throughput
>> threshold retries: 5
>> 2013-08-02 10:12:03,638 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2013-08-02 10:12:04,598 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=0
>> 2013-08-02 10:12:04,631 INFO  fetcher.Fetcher - -activeThreads=0,
>> spinWaiting=0, fetchQueues.totalSize=0
>> 2013-08-02 10:12:04,635 INFO  fetcher.Fetcher - -activeThreads=0
>> 2013-08-02 10:12:09,293 INFO  fetcher.Fetcher - Fetcher: finished at
>> 2013-08-02 10:12:09, elapsed: 00:00:07
>> 2013-08-02 10:12:09,296 INFO  parse.ParseSegment - ParseSegment: starting
>> at 2013-08-02 10:12:09
>> 2013-08-02 10:12:09,296 INFO  parse.ParseSegment - ParseSegment: segment:
>> crawl-0802-test-3/segments/20130802101154
>> 2013-08-02 10:12:10,335 DEBUG util.ObjectCache - No object cache found
>> for conf=Configuration: core-default.xml, core-site.xml,
>> mapred-default.xml, mapred-site.xml,
>> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0006.xml,
>> instantiating a new object cache
>> 2013-08-02 10:12:10,362 INFO  parse.ParserFactory - The parsing plugins:
>> [org.apache.nutch.parse.tika.TikaParser] are enabled via the
>> plugin.includes system property, and all claim to support the content type
>> application/xml, but they are not mapped to it  in the parse-plugins.xml
>> file
>> 2013-08-02 10:12:11,166 DEBUG parse.ParseUtil - Parsing [
>> http://www.#####.com/] with
>> [org.apache.nutch.parse.tika.TikaParser@4b3788e1]
>> *2013-08-02 10:12:11,168 DEBUG tika.TikaParser - Using Tika parser
>> org.apache.tika.parser.xml.DcXMLParser for mime-type application/xml*
>> *2013-08-02 10:12:11,232 ERROR tika.TikaParser - Error parsing
>> http://www.####.com/
>> org.apache.tika.exception.TikaException: XML parse error*
>>     at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
>>     at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:96)
>>     at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95)
>>     at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
>>     at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
>>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>> *Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber:
>> 144; The entity name must immediately follow the '&' in the entity
>> reference.*
>>     at
>> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown
>> Source)
>>     at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown
>> Source)
>>     at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>>     at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>>     at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>>     at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
>>     at org.apache.xerces.impl.XMLScanner.scanAttributeValue(Unknown
>> Source)
>>     at
>> org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown
>> Source)
>>     at
>> org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown
>> Source)
>>     at
>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
>> Source)
>>     at
>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
>> Source)
>>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>>     at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>>     at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
>>     at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
>> Source)
>>     at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
>>     at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
>>     at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
>>     ... 8 more
>> 2013-08-02 10:12:11,246 WARN  parse.ParseSegment -* Error parsing: 
>> http://www.####.com/:
>> failed(2,0): XML parse error*
>> 2013-08-02 10:12:11,256 INFO  crawl.SignatureFactory - Using Signature
>> impl: org.apache.nutch.crawl.MD5Signature
>> 2013-08-02 10:12:11,295 INFO  parse.ParseSegment - Parsed (50ms):
>> http://www.####.com/
>> 2013-08-02 10:12:12,701 DEBUG util.ObjectCache - No object cache found
>> for conf=Configuration: core-default.xml, core-site.xml,
>> mapred-default.xml, mapred-site.xml,
>> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0006.xml,
>> instantiating a new object cache
>> 2013-08-02 10:12:16,705 INFO  parse.ParseSegment - ParseSegment: finished
>> at 2013-08-02 10:12:16, elapsed: 00:00:07
>> 2013-08-02 10:12:16,709 INFO  crawl.CrawlDb - CrawlDb update: starting at
>> 2013-08-02 10:12:16
>> 2013-08-02 10:12:16,711 INFO  crawl.CrawlDb - CrawlDb update: db:
>> crawl-0802-test-3/crawldb
>> 2013-08-02 10:12:16,711 INFO  crawl.CrawlDb - CrawlDb update: segments:
>> [crawl-0802-test-3/segments/20130802101154]
>> 2013-08-02 10:12:16,711 INFO  crawl.CrawlDb - CrawlDb update: additions
>> allowed: true
>> 2013-08-02 10:12:16,712 INFO  crawl.CrawlDb - CrawlDb update: URL
>> normalizing: true
>> 2013-08-02 10:12:16,712 INFO  crawl.CrawlDb - CrawlDb update: URL
>> filtering: true
>> 2013-08-02 10:12:16,713 INFO  crawl.CrawlDb - CrawlDb update: 404
>> purging: false
>> 2013-08-02 10:12:16,713 INFO  crawl.CrawlDb - CrawlDb update: Merging
>> segment data into db.
>> 2013-08-02 10:12:17,579 DEBUG util.ObjectCache - No object cache found
>> for conf=Configuration: core-default.xml, core-site.xml,
>> mapred-default.xml, mapred-site.xml,
>> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0007.xml,
>> instantiating a new object cache
>> 2013-08-02 10:12:17,594 INFO  regex.RegexURLNormalizer - can't find rules
>> for scope 'crawldb', using default
>>
>>
>

Reply via email to