I just did and confirmed index-basic has no relevance to the crawl db. Here's 
a piece of log output for injector and crawl db reader. There are only two 
registered plugins, protocol-http and lib-http. After injection the crawldb 
has 1 entry which is the same URL as in my seed list.


2011-10-14 15:30:03,683 INFO  crawl.Injector - Injector: starting at 
2011-10-14 15:30:03
2011-10-14 15:30:03,684 INFO  crawl.Injector - Injector: crawlDb: 
crawl/crawldb
2011-10-14 15:30:03,684 INFO  crawl.Injector - Injector: urlDir: urls
2011-10-14 15:30:03,684 INFO  crawl.Injector - Injector: Converting injected 
urls to crawl db entries.
2011-10-14 15:30:04,041 INFO  plugin.PluginRepository - Plugins: looking in: 
/home/markus/projects/apache/nutch/trunk/runtime/local/plugins
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository - Plugin Auto-activation 
mode: [true]
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository - Registered Plugins:
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository -         the nutch core 
extension points (nutch-extensionpoints)
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository -         HTTP Framework 
(lib-http)
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository -         Http Protocol 
Plug-in (protocol-http)
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository - Registered Extension-
Points:
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository -         Nutch URL 
Normalizer (org.apache.nutch.net.URLNormalizer)
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository -         Nutch Protocol 
(org.apache.nutch.protocol.Protocol)
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository -         Nutch Segment 
Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository -         Nutch URL 
Filter (org.apache.nutch.net.URLFilter)
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository -         Nutch Indexing 
Filter (org.apache.nutch.indexer.IndexingFilter)
2011-10-14 15:30:04,132 INFO  plugin.PluginRepository -         HTML Parse 
Filter (org.apache.nutch.parse.HtmlParseFilter)
2011-10-14 15:30:04,132 INFO  plugin.PluginRepository -         Nutch Content 
Parser (org.apache.nutch.parse.Parser)
2011-10-14 15:30:04,132 INFO  plugin.PluginRepository -         Nutch Scoring 
(org.apache.nutch.scoring.ScoringFilter)
2011-10-14 15:30:04,946 INFO  crawl.Injector - Injector: Merging injected urls 
into crawl db.
2011-10-14 15:30:05,160 WARN  util.NativeCodeLoader - Unable to load native-
hadoop library for your platform... using builtin-java classes where 
applicable
2011-10-14 15:30:06,104 INFO  crawl.Injector - Injector: finished at 
2011-10-14 15:30:06, elapsed: 00:00:02
2011-10-14 15:30:08,727 INFO  crawl.CrawlDbReader - CrawlDb statistics start: 
crawl/crawldb/
2011-10-14 15:30:08,836 WARN  mapred.JobClient - Use GenericOptionsParser for 
parsing the arguments. Applications should implement Tool for the same.
2011-10-14 15:30:10,052 INFO  crawl.CrawlDbReader - Statistics for CrawlDb: 
crawl/crawldb/
2011-10-14 15:30:10,052 INFO  crawl.CrawlDbReader - TOTAL urls: 1
2011-10-14 15:30:10,052 INFO  crawl.CrawlDbReader - retry 0:    1
2011-10-14 15:30:10,052 INFO  crawl.CrawlDbReader - min score:  1.0
2011-10-14 15:30:10,052 INFO  crawl.CrawlDbReader - avg score:  1.0
2011-10-14 15:30:10,052 INFO  crawl.CrawlDbReader - max score:  1.0
2011-10-14 15:30:10,052 INFO  crawl.CrawlDbReader - status 1 (db_unfetched):    
1
2011-10-14 15:30:10,053 INFO  crawl.CrawlDbReader - CrawlDb statistics: done



On Friday 14 October 2011 15:23:00 Radim Kolar wrote:
> try it yourself. in 1.4 remove index-basic from list of included
> plugins, then run nutch inject in hadoop mode and you will get 0 rows on
> first map output.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to