Hi Laxmi I see that http://www.####.com/ <http://www./####.com/> mimeType is application/xml, so parse-html plugin think it can not parse xml content. so it use parse-tika to parse that XML content, but actually that content format is HTML. so I think that is not an issue. and u can also add a mimeType property in conf/parse-plugins.xml.
On Sat, Aug 3, 2013 at 3:23 AM, A Laxmi <[email protected]> wrote: > *all I did was add "application/xml" in file plugins/parse-html/plugin.xml > > > On Fri, Aug 2, 2013 at 3:22 PM, A Laxmi <[email protected]> wrote: > > > I could solve my issue. I am not sure if this was fixed in 1.7 or not. > But > > with Nutch 1.6, all I did was "application/xml" in file > > plugins/parse-html/plugin.xml -> <parameter name=contentType" > > value="text/html|application/xhtml+xml|*application/xml*" />. That fixed > > my issue. Hopefully it should help someone with the same problem. > > > > > > On Fri, Aug 2, 2013 at 10:48 AM, A Laxmi <[email protected]> wrote: > > > >> While Nutch 1.6, I could not crawl one particular site and it is giving > >> me the following error message in the parsing stage. I tried to google > this > >> issue, I tried changing parse.timeout to 3600 and I even tried changing > it > >> to -1, it doesn't seem to make any difference. > >> Please help. > >> > >> > >> Error message: Error parsing http://www.####.com/ failed(2,0): XML > parse > >> error > >> > >> From the logs: > >> > >> 2013-08-02 10:12:03,446 INFO fetcher.Fetcher - Using queue mode : > byHost > >> 2013-08-02 10:12:03,465 INFO http.Http - http.proxy.host = null > >> 2013-08-02 10:12:03,466 INFO http.Http - http.proxy.port = 8080 > >> 2013-08-02 10:12:03,466 INFO http.Http - http.timeout = 240000 > >> 2013-08-02 10:12:03,466 INFO http.Http - http.content.limit = -1 > >> 2013-08-02 10:12:03,466 INFO http.Http - http.agent = Nutch > >> Spider/Nutch-1.6 > >> 2013-08-02 10:12:03,466 INFO http.Http - http.accept.language = > >> en-us,en-gb,en;q=0.7,*;q=0.3 > >> 2013-08-02 10:12:03,466 INFO http.Http - http.accept = > >> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > >> 2013-08-02 10:12:03,472 INFO fetcher.Fetcher - -finishing thread > >> FetcherThread, activeThreads=1 > >> 2013-08-02 10:12:03,473 INFO fetcher.Fetcher - Using queue mode : > byHost > >> 2013-08-02 10:12:03,476 INFO fetcher.Fetcher - -finishing thread > >> FetcherThread, activeThreads=1 > >> 2013-08-02 10:12:03,489 INFO fetcher.Fetcher - Using queue mode : > byHost > >> 2013-08-02 10:12:03,489 INFO fetcher.Fetcher - Using queue mode : > byHost > >> 2013-08-02 10:12:03,610 INFO fetcher.Fetcher - -finishing thread > >> FetcherThread, activeThreads=3 > >> 2013-08-02 10:12:03,612 INFO fetcher.Fetcher - -finishing thread > >> FetcherThread, activeThreads=2 > >> 2013-08-02 10:12:03,619 INFO fetcher.Fetcher - -finishing thread > >> FetcherThread, activeThreads=1 > >> 2013-08-02 10:12:03,611 INFO fetcher.Fetcher - Using queue mode : > byHost > >> 2013-08-02 10:12:03,623 INFO fetcher.Fetcher - Using queue mode : > byHost > >> 2013-08-02 10:12:03,623 INFO fetcher.Fetcher - -finishing thread > >> FetcherThread, activeThreads=1 > >> 2013-08-02 10:12:03,623 INFO fetcher.Fetcher - Fetcher: throughput > >> threshold: -1 > >> 2013-08-02 10:12:03,623 INFO fetcher.Fetcher - Fetcher: throughput > >> threshold retries: 5 > >> 2013-08-02 10:12:03,638 INFO fetcher.Fetcher - -finishing thread > >> FetcherThread, activeThreads=1 > >> 2013-08-02 10:12:04,598 INFO fetcher.Fetcher - -finishing thread > >> FetcherThread, activeThreads=0 > >> 2013-08-02 10:12:04,631 INFO fetcher.Fetcher - -activeThreads=0, > >> spinWaiting=0, fetchQueues.totalSize=0 > >> 2013-08-02 10:12:04,635 INFO fetcher.Fetcher - -activeThreads=0 > >> 2013-08-02 10:12:09,293 INFO fetcher.Fetcher - Fetcher: finished at > >> 2013-08-02 10:12:09, elapsed: 00:00:07 > >> 2013-08-02 10:12:09,296 INFO parse.ParseSegment - ParseSegment: > starting > >> at 2013-08-02 10:12:09 > >> 2013-08-02 10:12:09,296 INFO parse.ParseSegment - ParseSegment: > segment: > >> crawl-0802-test-3/segments/20130802101154 > >> 2013-08-02 10:12:10,335 DEBUG util.ObjectCache - No object cache found > >> for conf=Configuration: core-default.xml, core-site.xml, > >> mapred-default.xml, mapred-site.xml, > >> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0006.xml, > >> instantiating a new object cache > >> 2013-08-02 10:12:10,362 INFO parse.ParserFactory - The parsing plugins: > >> [org.apache.nutch.parse.tika.TikaParser] are enabled via the > >> plugin.includes system property, and all claim to support the content > type > >> application/xml, but they are not mapped to it in the parse-plugins.xml > >> file > >> 2013-08-02 10:12:11,166 DEBUG parse.ParseUtil - Parsing [ > >> http://www.#####.com/] with > >> [org.apache.nutch.parse.tika.TikaParser@4b3788e1] > >> *2013-08-02 10:12:11,168 DEBUG tika.TikaParser - Using Tika parser > >> org.apache.tika.parser.xml.DcXMLParser for mime-type application/xml* > >> *2013-08-02 10:12:11,232 ERROR tika.TikaParser - Error parsing > >> http://www.####.com/ > >> org.apache.tika.exception.TikaException: XML parse error* > >> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78) > >> at > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:96) > >> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95) > >> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > >> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > >> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > >> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) > >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > >> at > >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > >> *Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: > >> 144; The entity name must immediately follow the '&' in the entity > >> reference.* > >> at > >> > org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown > >> Source) > >> at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown > >> Source) > >> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown > Source) > >> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown > Source) > >> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown > Source) > >> at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown > Source) > >> at org.apache.xerces.impl.XMLScanner.scanAttributeValue(Unknown > >> Source) > >> at > >> org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown > >> Source) > >> at > >> org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown > >> Source) > >> at > >> > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown > >> Source) > >> at > >> > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown > >> Source) > >> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown > Source) > >> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown > Source) > >> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > >> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) > >> at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown > >> Source) > >> at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) > >> at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) > >> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72) > >> ... 8 more > >> 2013-08-02 10:12:11,246 WARN parse.ParseSegment -* Error parsing: > http://www.####.com/: > >> failed(2,0): XML parse error* > >> 2013-08-02 10:12:11,256 INFO crawl.SignatureFactory - Using Signature > >> impl: org.apache.nutch.crawl.MD5Signature > >> 2013-08-02 10:12:11,295 INFO parse.ParseSegment - Parsed (50ms): > >> http://www.####.com/ > >> 2013-08-02 10:12:12,701 DEBUG util.ObjectCache - No object cache found > >> for conf=Configuration: core-default.xml, core-site.xml, > >> mapred-default.xml, mapred-site.xml, > >> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0006.xml, > >> instantiating a new object cache > >> 2013-08-02 10:12:16,705 INFO parse.ParseSegment - ParseSegment: > finished > >> at 2013-08-02 10:12:16, elapsed: 00:00:07 > >> 2013-08-02 10:12:16,709 INFO crawl.CrawlDb - CrawlDb update: starting > at > >> 2013-08-02 10:12:16 > >> 2013-08-02 10:12:16,711 INFO crawl.CrawlDb - CrawlDb update: db: > >> crawl-0802-test-3/crawldb > >> 2013-08-02 10:12:16,711 INFO crawl.CrawlDb - CrawlDb update: segments: > >> [crawl-0802-test-3/segments/20130802101154] > >> 2013-08-02 10:12:16,711 INFO crawl.CrawlDb - CrawlDb update: additions > >> allowed: true > >> 2013-08-02 10:12:16,712 INFO crawl.CrawlDb - CrawlDb update: URL > >> normalizing: true > >> 2013-08-02 10:12:16,712 INFO crawl.CrawlDb - CrawlDb update: URL > >> filtering: true > >> 2013-08-02 10:12:16,713 INFO crawl.CrawlDb - CrawlDb update: 404 > >> purging: false > >> 2013-08-02 10:12:16,713 INFO crawl.CrawlDb - CrawlDb update: Merging > >> segment data into db. > >> 2013-08-02 10:12:17,579 DEBUG util.ObjectCache - No object cache found > >> for conf=Configuration: core-default.xml, core-site.xml, > >> mapred-default.xml, mapred-site.xml, > >> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0007.xml, > >> instantiating a new object cache > >> 2013-08-02 10:12:17,594 INFO regex.RegexURLNormalizer - can't find > rules > >> for scope 'crawldb', using default > >> > >> > > > -- Don't Grow Old, Grow Up... :-)

