[ https://issues.apache.org/jira/browse/NUTCH-852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma closed NUTCH-852. ------------------------------- Closing all resolved issues with a non-fixed status. > parser not found for contentType=application/xhtml+xml > ------------------------------------------------------ > > Key: NUTCH-852 > URL: https://issues.apache.org/jira/browse/NUTCH-852 > Project: Nutch > Issue Type: Bug > Environment: window XP sp3, cygwin > Reporter: Pham Tuan Minh > Assignee: Julien Nioche > Fix For: 2.0 > > > I config nutch trunk to crawl sample site (http://www.lucidimagination.com/), > then it post to solr server for indexing, however, I got following error. It > seems tika parser is not working properly or tika libraries is not recognized! > ---------------------- > $ bin/nutch-local crawl urls -solr http://127.0.0.1:8983/solr/ -dir crawl > -depth 3 -topN 50 > crawl started in: crawl > rootUrlDir = urls > threads = 10 > depth = 3 > solrUrl=http://127.0.0.1:8983/solr/ > topN = 50 > Injector: starting at 2010-07-14 02:08:20 > Injector: crawlDb: crawl/crawldb > Injector: urlDir: urls > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: finished at 2010-07-14 02:08:31, elapsed: 00:00:11 > Generator: starting at 2010-07-14 02:08:32 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: topN: 50 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls for politeness. > Generator: segment: crawl/segments/20100714020838 > Generator: finished at 2010-07-14 02:08:42, elapsed: 00:00:10 > Fetcher: Your 'http.agent.name' value should be listed first in > 'http.robots.age > nts' property. > Fetcher: starting at 2010-07-14 02:08:42 > Fetcher: segment: crawl/segments/20100714020838 > Fetcher: threads: 10 > QueueFeeder finished: total 1 records + hit by time limit :0 > fetching http://www.lucidimagination.com/ > -finishing thread FetcherThread, activeThreads=4 > -finishing thread FetcherThread, activeThreads=1 > -finishing thread FetcherThread, activeThreads=2 > -finishing thread FetcherThread, activeThreads=3 > -finishing thread FetcherThread, activeThreads=5 > -finishing thread FetcherThread, activeThreads=7 > -finishing thread FetcherThread, activeThreads=8 > -finishing thread FetcherThread, activeThreads=9 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > Error parsing: http://www.lucidimagination.com/: > org.apache.nutch.parse.ParseException: parser not found for > contentType=application/xhtml+xml url=http://www.lucidimagination.com/ > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:879) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:647) > -finishing thread FetcherThread, activeThreads=0 > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=0 > Fetcher: finished at 2010-07-14 02:08:54, elapsed: 00:00:12 > CrawlDb update: starting at 2010-07-14 02:08:54 > CrawlDb update: db: crawl/crawldb > CrawlDb update: segments: [crawl/segments/20100714020838] > CrawlDb update: additions allowed: true > $ > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: finished at 2010-07-14 02:09:01, elapsed: 00:00:07 > $ > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: topN: 50 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: 0 records selected for fetching, exiting ... > Stopping at depth=1 - no more URLs to fetch. > LinkDb: starting at 2010-07-14 02:09:06 > LinkDb: linkdb: crawl/linkdb > LinkDb: URL normalize: true > LinkDb: URL filter: true > LinkDb: adding segment: > file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714014136 > LinkDb: adding segment: > file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714015544 > LinkDb: adding segment: > file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714020206 > LinkDb: adding segment: > file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714020232 > LinkDb: adding segment: > file:/D:/work/workspace/nutch/runtime/local/crawl/segments/20100714020838 > LinkDb: merging with existing linkdb: crawl/linkdb > LinkDb: finished at 2010-07-14 02:09:19, elapsed: 00:00:12 > SolrIndexer: starting at 2010-07-14 02:09:19 > SolrIndexer: finished at 2010-07-14 02:09:36, elapsed: 00:00:17 > SolrDeleteDuplicates: starting at 2010-07-14 02:09:41 > SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr/ > SolrDeleteDuplicates: finished at 2010-07-14 02:09:45, elapsed: 00:00:04 > crawl finished: crawl > ---------------------- > Thanks -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira