Hi Laxmi

I see that  http://www.####.com/ <http://www./####.com/> mimeType is
application/xml, so parse-html plugin think it can not parse xml content.
so it use parse-tika to parse that XML content, but actually that content
format is HTML. so I think that is not an issue. and u can also add a
mimeType property in conf/parse-plugins.xml.



On Sat, Aug 3, 2013 at 3:23 AM, A Laxmi <[email protected]> wrote:

> *all I did was add "application/xml" in file plugins/parse-html/plugin.xml
>
>
> On Fri, Aug 2, 2013 at 3:22 PM, A Laxmi <[email protected]> wrote:
>
> > I could solve my issue. I am not sure if this was fixed in 1.7 or not.
> But
> > with Nutch 1.6, all I did was "application/xml" in file
> > plugins/parse-html/plugin.xml -> <parameter name=contentType"
> > value="text/html|application/xhtml+xml|*application/xml*" />. That fixed
> > my issue. Hopefully it should help someone with the same problem.
> >
> >
> > On Fri, Aug 2, 2013 at 10:48 AM, A Laxmi <[email protected]> wrote:
> >
> >> While Nutch 1.6, I could not crawl one particular site and it is giving
> >> me the following error message in the parsing stage. I tried to google
> this
> >> issue, I tried changing parse.timeout to 3600 and I even tried changing
> it
> >> to -1, it doesn't seem to make any difference.
> >> Please help.
> >>
> >>
> >> Error message: Error parsing http://www.####.com/ failed(2,0): XML
> parse
> >> error
> >>
> >> From the logs:
> >>
> >> 2013-08-02 10:12:03,446 INFO  fetcher.Fetcher - Using queue mode :
> byHost
> >> 2013-08-02 10:12:03,465 INFO  http.Http - http.proxy.host = null
> >> 2013-08-02 10:12:03,466 INFO  http.Http - http.proxy.port = 8080
> >> 2013-08-02 10:12:03,466 INFO  http.Http - http.timeout = 240000
> >> 2013-08-02 10:12:03,466 INFO  http.Http - http.content.limit = -1
> >> 2013-08-02 10:12:03,466 INFO  http.Http - http.agent = Nutch
> >> Spider/Nutch-1.6
> >> 2013-08-02 10:12:03,466 INFO  http.Http - http.accept.language =
> >> en-us,en-gb,en;q=0.7,*;q=0.3
> >> 2013-08-02 10:12:03,466 INFO  http.Http - http.accept =
> >> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> >> 2013-08-02 10:12:03,472 INFO  fetcher.Fetcher - -finishing thread
> >> FetcherThread, activeThreads=1
> >> 2013-08-02 10:12:03,473 INFO  fetcher.Fetcher - Using queue mode :
> byHost
> >> 2013-08-02 10:12:03,476 INFO  fetcher.Fetcher - -finishing thread
> >> FetcherThread, activeThreads=1
> >> 2013-08-02 10:12:03,489 INFO  fetcher.Fetcher - Using queue mode :
> byHost
> >> 2013-08-02 10:12:03,489 INFO  fetcher.Fetcher - Using queue mode :
> byHost
> >> 2013-08-02 10:12:03,610 INFO  fetcher.Fetcher - -finishing thread
> >> FetcherThread, activeThreads=3
> >> 2013-08-02 10:12:03,612 INFO  fetcher.Fetcher - -finishing thread
> >> FetcherThread, activeThreads=2
> >> 2013-08-02 10:12:03,619 INFO  fetcher.Fetcher - -finishing thread
> >> FetcherThread, activeThreads=1
> >> 2013-08-02 10:12:03,611 INFO  fetcher.Fetcher - Using queue mode :
> byHost
> >> 2013-08-02 10:12:03,623 INFO  fetcher.Fetcher - Using queue mode :
> byHost
> >> 2013-08-02 10:12:03,623 INFO  fetcher.Fetcher - -finishing thread
> >> FetcherThread, activeThreads=1
> >> 2013-08-02 10:12:03,623 INFO  fetcher.Fetcher - Fetcher: throughput
> >> threshold: -1
> >> 2013-08-02 10:12:03,623 INFO  fetcher.Fetcher - Fetcher: throughput
> >> threshold retries: 5
> >> 2013-08-02 10:12:03,638 INFO  fetcher.Fetcher - -finishing thread
> >> FetcherThread, activeThreads=1
> >> 2013-08-02 10:12:04,598 INFO  fetcher.Fetcher - -finishing thread
> >> FetcherThread, activeThreads=0
> >> 2013-08-02 10:12:04,631 INFO  fetcher.Fetcher - -activeThreads=0,
> >> spinWaiting=0, fetchQueues.totalSize=0
> >> 2013-08-02 10:12:04,635 INFO  fetcher.Fetcher - -activeThreads=0
> >> 2013-08-02 10:12:09,293 INFO  fetcher.Fetcher - Fetcher: finished at
> >> 2013-08-02 10:12:09, elapsed: 00:00:07
> >> 2013-08-02 10:12:09,296 INFO  parse.ParseSegment - ParseSegment:
> starting
> >> at 2013-08-02 10:12:09
> >> 2013-08-02 10:12:09,296 INFO  parse.ParseSegment - ParseSegment:
> segment:
> >> crawl-0802-test-3/segments/20130802101154
> >> 2013-08-02 10:12:10,335 DEBUG util.ObjectCache - No object cache found
> >> for conf=Configuration: core-default.xml, core-site.xml,
> >> mapred-default.xml, mapred-site.xml,
> >> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0006.xml,
> >> instantiating a new object cache
> >> 2013-08-02 10:12:10,362 INFO  parse.ParserFactory - The parsing plugins:
> >> [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> >> plugin.includes system property, and all claim to support the content
> type
> >> application/xml, but they are not mapped to it  in the parse-plugins.xml
> >> file
> >> 2013-08-02 10:12:11,166 DEBUG parse.ParseUtil - Parsing [
> >> http://www.#####.com/] with
> >> [org.apache.nutch.parse.tika.TikaParser@4b3788e1]
> >> *2013-08-02 10:12:11,168 DEBUG tika.TikaParser - Using Tika parser
> >> org.apache.tika.parser.xml.DcXMLParser for mime-type application/xml*
> >> *2013-08-02 10:12:11,232 ERROR tika.TikaParser - Error parsing
> >> http://www.####.com/
> >> org.apache.tika.exception.TikaException: XML parse error*
> >>     at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
> >>     at
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:96)
> >>     at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95)
> >>     at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> >>     at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> >>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> >>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> >>     at
> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> >> *Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber:
> >> 144; The entity name must immediately follow the '&' in the entity
> >> reference.*
> >>     at
> >>
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown
> >> Source)
> >>     at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown
> >> Source)
> >>     at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
> Source)
> >>     at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
> Source)
> >>     at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
> Source)
> >>     at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown
> Source)
> >>     at org.apache.xerces.impl.XMLScanner.scanAttributeValue(Unknown
> >> Source)
> >>     at
> >> org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown
> >> Source)
> >>     at
> >> org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown
> >> Source)
> >>     at
> >>
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
> >> Source)
> >>     at
> >>
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
> >> Source)
> >>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
> Source)
> >>     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
> Source)
> >>     at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
> >>     at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
> >>     at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
> >> Source)
> >>     at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
> >>     at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
> >>     at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
> >>     ... 8 more
> >> 2013-08-02 10:12:11,246 WARN  parse.ParseSegment -* Error parsing:
> http://www.####.com/:
> >> failed(2,0): XML parse error*
> >> 2013-08-02 10:12:11,256 INFO  crawl.SignatureFactory - Using Signature
> >> impl: org.apache.nutch.crawl.MD5Signature
> >> 2013-08-02 10:12:11,295 INFO  parse.ParseSegment - Parsed (50ms):
> >> http://www.####.com/
> >> 2013-08-02 10:12:12,701 DEBUG util.ObjectCache - No object cache found
> >> for conf=Configuration: core-default.xml, core-site.xml,
> >> mapred-default.xml, mapred-site.xml,
> >> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0006.xml,
> >> instantiating a new object cache
> >> 2013-08-02 10:12:16,705 INFO  parse.ParseSegment - ParseSegment:
> finished
> >> at 2013-08-02 10:12:16, elapsed: 00:00:07
> >> 2013-08-02 10:12:16,709 INFO  crawl.CrawlDb - CrawlDb update: starting
> at
> >> 2013-08-02 10:12:16
> >> 2013-08-02 10:12:16,711 INFO  crawl.CrawlDb - CrawlDb update: db:
> >> crawl-0802-test-3/crawldb
> >> 2013-08-02 10:12:16,711 INFO  crawl.CrawlDb - CrawlDb update: segments:
> >> [crawl-0802-test-3/segments/20130802101154]
> >> 2013-08-02 10:12:16,711 INFO  crawl.CrawlDb - CrawlDb update: additions
> >> allowed: true
> >> 2013-08-02 10:12:16,712 INFO  crawl.CrawlDb - CrawlDb update: URL
> >> normalizing: true
> >> 2013-08-02 10:12:16,712 INFO  crawl.CrawlDb - CrawlDb update: URL
> >> filtering: true
> >> 2013-08-02 10:12:16,713 INFO  crawl.CrawlDb - CrawlDb update: 404
> >> purging: false
> >> 2013-08-02 10:12:16,713 INFO  crawl.CrawlDb - CrawlDb update: Merging
> >> segment data into db.
> >> 2013-08-02 10:12:17,579 DEBUG util.ObjectCache - No object cache found
> >> for conf=Configuration: core-default.xml, core-site.xml,
> >> mapred-default.xml, mapred-site.xml,
> >> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0007.xml,
> >> instantiating a new object cache
> >> 2013-08-02 10:12:17,594 INFO  regex.RegexURLNormalizer - can't find
> rules
> >> for scope 'crawldb', using default
> >>
> >>
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Reply via email to