Hi, I’m running Nutch 2 on a 2-node Hadoop cluster to do whole web crawling. I’m seeding about 700 URLs from the DMOZ directory. About the same number is being injected. The problem is that nothing is being generated after the inject phase. Subsequently nothing is being indexed either.
The trace of the entire crawl job is here: 14/05/28 06:54:23 INFO crawl.InjectorJob: InjectorJob: starting at 2014-05-28 06:54:23 14/05/28 06:54:23 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: urls/seed.txt 14/05/28 06:54:24 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class. 14/05/28 06:54:25 INFO input.FileInputFormat: Total input paths to process : 1 14/05/28 06:54:25 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/05/28 06:54:25 WARN snappy.LoadSnappy: Snappy native library not loaded 14/05/28 06:54:25 INFO mapred.JobClient: Running job: job_201405280024_0015 14/05/28 06:54:26 INFO mapred.JobClient: map 0% reduce 0% 14/05/28 06:54:36 INFO mapred.JobClient: map 100% reduce 0% 14/05/28 06:54:40 INFO mapred.JobClient: Job complete: job_201405280024_0015 14/05/28 06:54:40 INFO mapred.JobClient: Counters: 20 14/05/28 06:54:40 INFO mapred.JobClient: Job Counters 14/05/28 06:54:40 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=10927 14/05/28 06:54:40 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/05/28 06:54:40 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/05/28 06:54:40 INFO mapred.JobClient: Launched map tasks=1 14/05/28 06:54:40 INFO mapred.JobClient: Data-local map tasks=1 14/05/28 06:54:40 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 14/05/28 06:54:40 INFO mapred.JobClient: File Output Format Counters 14/05/28 06:54:40 INFO mapred.JobClient: Bytes Written=0 14/05/28 06:54:40 INFO mapred.JobClient: injector 14/05/28 06:54:40 INFO mapred.JobClient: urls_injected=765 14/05/28 06:54:40 INFO mapred.JobClient: urls_filtered=14 14/05/28 06:54:40 INFO mapred.JobClient: FileSystemCounters 14/05/28 06:54:40 INFO mapred.JobClient: HDFS_BYTES_READ=26006 14/05/28 06:54:40 INFO mapred.JobClient: FILE_BYTES_WRITTEN=77762 14/05/28 06:54:40 INFO mapred.JobClient: File Input Format Counters 14/05/28 06:54:40 INFO mapred.JobClient: Bytes Read=25896 14/05/28 06:54:40 INFO mapred.JobClient: Map-Reduce Framework 14/05/28 06:54:40 INFO mapred.JobClient: Map input records=779 14/05/28 06:54:40 INFO mapred.JobClient: Physical memory (bytes) snapshot=113258496 14/05/28 06:54:40 INFO mapred.JobClient: Spilled Records=0 14/05/28 06:54:40 INFO mapred.JobClient: CPU time spent (ms)=2530 14/05/28 06:54:40 INFO mapred.JobClient: Total committed heap usage (bytes)=58195968 14/05/28 06:54:40 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1118162944 14/05/28 06:54:40 INFO mapred.JobClient: Map output records=765 14/05/28 06:54:40 INFO mapred.JobClient: SPLIT_RAW_BYTES=110 14/05/28 06:54:40 INFO crawl.InjectorJob: InjectorJob: total number of urls rejected by filters: 14 14/05/28 06:54:40 INFO crawl.InjectorJob: InjectorJob: total number of urls injected after normalization and filtering: 765 14/05/28 06:54:40 INFO crawl.InjectorJob: Injector: finished at 2014-05-28 06:54:40, elapsed: 00:00:16 Wed May 28 06:54:40 EDT 2014 : Iteration 1 of 2 Generating batchId Generating a new fetchlist Warning: $HADOOP_HOME is deprecated. 14/05/28 06:54:42 INFO crawl.GeneratorJob: GeneratorJob: starting at 2014-05-28 06:54:42 14/05/28 06:54:42 INFO crawl.GeneratorJob: GeneratorJob: Selecting best-scoring urls due for fetch. 14/05/28 06:54:42 INFO crawl.GeneratorJob: GeneratorJob: starting 14/05/28 06:54:42 INFO crawl.GeneratorJob: GeneratorJob: filtering: false 14/05/28 06:54:42 INFO crawl.GeneratorJob: GeneratorJob: normalizing: false 14/05/28 06:54:42 INFO crawl.GeneratorJob: GeneratorJob: topN: 50000 14/05/28 06:54:42 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 14/05/28 06:54:42 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000 14/05/28 06:54:42 INFO crawl.AbstractFetchSchedule: maxInterval=7776000 14/05/28 06:54:44 INFO mapred.JobClient: Running job: job_201405280024_0016 14/05/28 06:54:45 INFO mapred.JobClient: map 0% reduce 0% 14/05/28 06:54:55 INFO mapred.JobClient: map 100% reduce 0% 14/05/28 06:55:03 INFO mapred.JobClient: map 100% reduce 16% 14/05/28 06:55:04 INFO mapred.JobClient: map 100% reduce 50% 14/05/28 07:02:29 INFO mapred.JobClient: Task Id : attempt_201405280024_0016_r_000001_0, Status : FAILED Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. 14/05/28 07:02:39 INFO mapred.JobClient: map 100% reduce 66% 14/05/28 07:02:40 INFO mapred.JobClient: map 100% reduce 100% 14/05/28 07:02:43 INFO mapred.JobClient: Job complete: job_201405280024_0016 14/05/28 07:02:43 INFO mapred.JobClient: Counters: 27 14/05/28 07:02:43 INFO mapred.JobClient: Job Counters 14/05/28 07:02:43 INFO mapred.JobClient: Launched reduce tasks=3 14/05/28 07:02:43 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=11387 14/05/28 07:02:43 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/05/28 07:02:43 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/05/28 07:02:43 INFO mapred.JobClient: Launched map tasks=1 14/05/28 07:02:43 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=23048 14/05/28 07:02:43 INFO mapred.JobClient: File Output Format Counters 14/05/28 07:02:43 INFO mapred.JobClient: Bytes Written=0 14/05/28 07:02:43 INFO mapred.JobClient: FileSystemCounters 14/05/28 07:02:43 INFO mapred.JobClient: FILE_BYTES_READ=44 14/05/28 07:02:43 INFO mapred.JobClient: HDFS_BYTES_READ=833 14/05/28 07:02:43 INFO mapred.JobClient: FILE_BYTES_WRITTEN=239555 14/05/28 07:02:43 INFO mapred.JobClient: File Input Format Counters 14/05/28 07:02:43 INFO mapred.JobClient: Bytes Read=0 14/05/28 07:02:43 INFO mapred.JobClient: Map-Reduce Framework 14/05/28 07:02:43 INFO mapred.JobClient: Map output materialized bytes=28 14/05/28 07:02:43 INFO mapred.JobClient: Map input records=0 14/05/28 07:02:43 INFO mapred.JobClient: Reduce shuffle bytes=28 14/05/28 07:02:43 INFO mapred.JobClient: Spilled Records=0 14/05/28 07:02:43 INFO mapred.JobClient: Map output bytes=0 14/05/28 07:02:43 INFO mapred.JobClient: Total committed heap usage (bytes)=277872640 14/05/28 07:02:43 INFO mapred.JobClient: CPU time spent (ms)=4130 14/05/28 07:02:43 INFO mapred.JobClient: Combine input records=0 14/05/28 07:02:43 INFO mapred.JobClient: SPLIT_RAW_BYTES=833 14/05/28 07:02:43 INFO mapred.JobClient: Reduce input records=0 14/05/28 07:02:43 INFO mapred.JobClient: Reduce input groups=0 14/05/28 07:02:43 INFO mapred.JobClient: Combine output records=0 14/05/28 07:02:43 INFO mapred.JobClient: Physical memory (bytes) snapshot=422510592 14/05/28 07:02:43 INFO mapred.JobClient: Reduce output records=0 14/05/28 07:02:43 INFO mapred.JobClient: Virtual memory (bytes) snapshot=5982715904 14/05/28 07:02:43 INFO mapred.JobClient: Map output records=0 14/05/28 07:02:43 INFO crawl.GeneratorJob: GeneratorJob: finished at 2014-05-28 07:02:43, time elapsed: 00:08:00 14/05/28 07:02:43 INFO crawl.GeneratorJob: GeneratorJob: generated batch id: 1401274480-22738 Fetching : Warning: $HADOOP_HOME is deprecated. 14/05/28 07:02:45 INFO fetcher.FetcherJob: FetcherJob: starting 14/05/28 07:02:45 INFO fetcher.FetcherJob: FetcherJob: batchId: 1401274480-22738 14/05/28 07:02:45 INFO fetcher.FetcherJob: FetcherJob: threads: 50 14/05/28 07:02:45 INFO fetcher.FetcherJob: FetcherJob: parsing: false 14/05/28 07:02:45 INFO fetcher.FetcherJob: FetcherJob: resuming: false 14/05/28 07:02:45 INFO fetcher.FetcherJob: FetcherJob : timelimit set for : 1401285765716 14/05/28 07:02:46 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar110933996696870181/classes/plugins 14/05/28 07:02:46 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true] 14/05/28 07:02:46 INFO plugin.PluginRepository: Registered Plugins: 14/05/28 07:02:46 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints) 14/05/28 07:02:46 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic) 14/05/28 07:02:46 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html) 14/05/28 07:02:46 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic) 14/05/28 07:02:46 INFO plugin.PluginRepository: Http / Https Protocol Plug-in (protocol-httpclient) 14/05/28 07:02:46 INFO plugin.PluginRepository: HTTP Framework (lib-http) 14/05/28 07:02:46 INFO plugin.PluginRepository: Creative Commons Plugins (creativecommons) 14/05/28 07:02:46 INFO plugin.PluginRepository: More Indexing Filter (index-more) 14/05/28 07:02:46 INFO plugin.PluginRepository: Regex URL Filter (urlfilter-regex) 14/05/28 07:02:46 INFO plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass) 14/05/28 07:02:46 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex) 14/05/28 07:02:46 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic) 14/05/28 07:02:46 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml) 14/05/28 07:02:46 INFO plugin.PluginRepository: JavaScript Parser (parse-js) 14/05/28 07:02:46 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter) 14/05/28 07:02:46 INFO plugin.PluginRepository: Registered Extension-Points: 14/05/28 07:02:46 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 14/05/28 07:02:46 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol) 14/05/28 07:02:46 INFO plugin.PluginRepository: Parse Filter (org.apache.nutch.parse.ParseFilter) 14/05/28 07:02:46 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter) 14/05/28 07:02:46 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 14/05/28 07:02:46 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser) 14/05/28 07:02:46 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 14/05/28 07:02:46 INFO httpclient.Http: http.proxy.host = null 14/05/28 07:02:46 INFO httpclient.Http: http.proxy.port = 8080 14/05/28 07:02:46 INFO httpclient.Http: http.timeout = 10000 14/05/28 07:02:46 INFO httpclient.Http: http.content.limit = 65536 14/05/28 07:02:46 INFO httpclient.Http: http.agent = Qontifi/Nutch-2.2.1 (A big data analytics and social media intelligence platform; http://qontifi.com; manikandan at thesocialpeople dot net) 14/05/28 07:02:46 INFO httpclient.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 14/05/28 07:02:46 INFO httpclient.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 14/05/28 07:02:46 INFO conf.Configuration: found resource httpclient-auth.xml at file:/app/hadoop/tmp/hadoop-unjar110933996696870181/httpclient-auth.xml 14/05/28 07:02:46 INFO httpclient.Http: http.proxy.host = null 14/05/28 07:02:46 INFO httpclient.Http: http.proxy.port = 8080 14/05/28 07:02:46 INFO httpclient.Http: http.timeout = 10000 14/05/28 07:02:46 INFO httpclient.Http: http.content.limit = 65536 14/05/28 07:02:46 INFO httpclient.Http: http.agent = Qontifi/Nutch-2.2.1 (A big data analytics and social media intelligence platform; http://qontifi.com; manikandan at thesocialpeople dot net) 14/05/28 07:02:46 INFO httpclient.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 14/05/28 07:02:46 INFO httpclient.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 14/05/28 07:02:49 INFO mapred.JobClient: Running job: job_201405280024_0017 14/05/28 07:02:50 INFO mapred.JobClient: map 0% reduce 0% 14/05/28 07:03:01 INFO mapred.JobClient: map 100% reduce 0% 14/05/28 07:03:10 INFO mapred.JobClient: map 100% reduce 16% 14/05/28 07:03:13 INFO mapred.JobClient: map 100% reduce 50% 14/05/28 07:10:34 INFO mapred.JobClient: Task Id : attempt_201405280024_0017_r_000001_0, Status : FAILED Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. 14/05/28 07:10:44 INFO mapred.JobClient: map 100% reduce 66% 14/05/28 07:10:47 INFO mapred.JobClient: map 100% reduce 100% 14/05/28 07:10:54 INFO mapred.JobClient: Job complete: job_201405280024_0017 14/05/28 07:10:54 INFO mapred.JobClient: Counters: 28 14/05/28 07:10:54 INFO mapred.JobClient: Job Counters 14/05/28 07:10:54 INFO mapred.JobClient: Launched reduce tasks=3 14/05/28 07:10:54 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=11752 14/05/28 07:10:54 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/05/28 07:10:54 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/05/28 07:10:54 INFO mapred.JobClient: Launched map tasks=1 14/05/28 07:10:54 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=33613 14/05/28 07:10:54 INFO mapred.JobClient: File Output Format Counters 14/05/28 07:10:54 INFO mapred.JobClient: Bytes Written=0 14/05/28 07:10:54 INFO mapred.JobClient: FileSystemCounters 14/05/28 07:10:54 INFO mapred.JobClient: FILE_BYTES_READ=44 14/05/28 07:10:54 INFO mapred.JobClient: HDFS_BYTES_READ=817 14/05/28 07:10:54 INFO mapred.JobClient: FILE_BYTES_WRITTEN=238025 14/05/28 07:10:54 INFO mapred.JobClient: File Input Format Counters 14/05/28 07:10:54 INFO mapred.JobClient: Bytes Read=0 14/05/28 07:10:54 INFO mapred.JobClient: FetcherStatus 14/05/28 07:10:54 INFO mapred.JobClient: HitByTimeLimit-QueueFeeder=0 14/05/28 07:10:54 INFO mapred.JobClient: Map-Reduce Framework 14/05/28 07:10:54 INFO mapred.JobClient: Map output materialized bytes=28 14/05/28 07:10:54 INFO mapred.JobClient: Map input records=0 14/05/28 07:10:54 INFO mapred.JobClient: Reduce shuffle bytes=28 14/05/28 07:10:54 INFO mapred.JobClient: Spilled Records=0 14/05/28 07:10:54 INFO mapred.JobClient: Map output bytes=0 14/05/28 07:10:54 INFO mapred.JobClient: Total committed heap usage (bytes)=317194240 14/05/28 07:10:54 INFO mapred.JobClient: CPU time spent (ms)=6460 14/05/28 07:10:54 INFO mapred.JobClient: Combine input records=0 14/05/28 07:10:54 INFO mapred.JobClient: SPLIT_RAW_BYTES=817 14/05/28 07:10:54 INFO mapred.JobClient: Reduce input records=0 14/05/28 07:10:54 INFO mapred.JobClient: Reduce input groups=0 14/05/28 07:10:54 INFO mapred.JobClient: Combine output records=0 14/05/28 07:10:54 INFO mapred.JobClient: Physical memory (bytes) snapshot=444006400 14/05/28 07:10:54 INFO mapred.JobClient: Reduce output records=0 14/05/28 07:10:54 INFO mapred.JobClient: Virtual memory (bytes) snapshot=6052544512 14/05/28 07:10:54 INFO mapred.JobClient: Map output records=0 14/05/28 07:10:54 INFO fetcher.FetcherJob: FetcherJob: done Parsing : Warning: $HADOOP_HOME is deprecated. 14/05/28 07:10:56 INFO parse.ParserJob: ParserJob: starting 14/05/28 07:10:56 INFO parse.ParserJob: ParserJob: resuming: false 14/05/28 07:10:56 INFO parse.ParserJob: ParserJob: forced reparse: false 14/05/28 07:10:56 INFO parse.ParserJob: ParserJob: batchId: 1401274480-22738 14/05/28 07:10:57 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar1161270060222812225/classes/plugins 14/05/28 07:10:57 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true] 14/05/28 07:10:57 INFO plugin.PluginRepository: Registered Plugins: 14/05/28 07:10:57 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints) 14/05/28 07:10:57 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic) 14/05/28 07:10:57 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html) 14/05/28 07:10:57 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic) 14/05/28 07:10:57 INFO plugin.PluginRepository: Http / Https Protocol Plug-in (protocol-httpclient) 14/05/28 07:10:57 INFO plugin.PluginRepository: HTTP Framework (lib-http) 14/05/28 07:10:57 INFO plugin.PluginRepository: Creative Commons Plugins (creativecommons) 14/05/28 07:10:57 INFO plugin.PluginRepository: More Indexing Filter (index-more) 14/05/28 07:10:57 INFO plugin.PluginRepository: Regex URL Filter (urlfilter-regex) 14/05/28 07:10:57 INFO plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass) 14/05/28 07:10:57 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex) 14/05/28 07:10:57 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic) 14/05/28 07:10:57 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml) 14/05/28 07:10:57 INFO plugin.PluginRepository: JavaScript Parser (parse-js) 14/05/28 07:10:57 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter) 14/05/28 07:10:57 INFO plugin.PluginRepository: Registered Extension-Points: 14/05/28 07:10:57 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 14/05/28 07:10:57 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol) 14/05/28 07:10:57 INFO plugin.PluginRepository: Parse Filter (org.apache.nutch.parse.ParseFilter) 14/05/28 07:10:57 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter) 14/05/28 07:10:57 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 14/05/28 07:10:57 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser) 14/05/28 07:10:57 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 14/05/28 07:10:57 INFO conf.Configuration: found resource parse-plugins.xml at file:/app/hadoop/tmp/hadoop-unjar1161270060222812225/parse-plugins.xml 14/05/28 07:10:57 INFO crawl.SignatureFactory: Using Signature impl: org.apache.nutch.crawl.MD5Signature 14/05/28 07:10:59 INFO mapred.JobClient: Running job: job_201405280024_0018 14/05/28 07:11:00 INFO mapred.JobClient: map 0% reduce 0% 14/05/28 07:11:07 INFO mapred.JobClient: map 100% reduce 0% 14/05/28 07:11:09 INFO mapred.JobClient: Job complete: job_201405280024_0018 14/05/28 07:11:09 INFO mapred.JobClient: Counters: 17 14/05/28 07:11:09 INFO mapred.JobClient: Job Counters 14/05/28 07:11:09 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=7869 14/05/28 07:11:09 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/05/28 07:11:09 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/05/28 07:11:09 INFO mapred.JobClient: Launched map tasks=1 14/05/28 07:11:09 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 14/05/28 07:11:09 INFO mapred.JobClient: File Output Format Counters 14/05/28 07:11:09 INFO mapred.JobClient: Bytes Written=0 14/05/28 07:11:09 INFO mapred.JobClient: FileSystemCounters 14/05/28 07:11:09 INFO mapred.JobClient: HDFS_BYTES_READ=861 14/05/28 07:11:09 INFO mapred.JobClient: FILE_BYTES_WRITTEN=78891 14/05/28 07:11:09 INFO mapred.JobClient: File Input Format Counters 14/05/28 07:11:09 INFO mapred.JobClient: Bytes Read=0 14/05/28 07:11:09 INFO mapred.JobClient: Map-Reduce Framework 14/05/28 07:11:09 INFO mapred.JobClient: Map input records=0 14/05/28 07:11:09 INFO mapred.JobClient: Physical memory (bytes) snapshot=114253824 14/05/28 07:11:09 INFO mapred.JobClient: Spilled Records=0 14/05/28 07:11:09 INFO mapred.JobClient: CPU time spent (ms)=1070 14/05/28 07:11:09 INFO mapred.JobClient: Total committed heap usage (bytes)=58195968 14/05/28 07:11:09 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1987776512 14/05/28 07:11:09 INFO mapred.JobClient: Map output records=0 14/05/28 07:11:09 INFO mapred.JobClient: SPLIT_RAW_BYTES=861 14/05/28 07:11:09 INFO parse.ParserJob: ParserJob: success CrawlDB update for TestCrawl Warning: $HADOOP_HOME is deprecated. 14/05/28 07:11:12 INFO crawl.DbUpdaterJob: DbUpdaterJob: starting 14/05/28 07:11:13 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar5400634919722418143/classes/plugins 14/05/28 07:11:13 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true] 14/05/28 07:11:13 INFO plugin.PluginRepository: Registered Plugins: 14/05/28 07:11:13 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints) 14/05/28 07:11:13 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic) 14/05/28 07:11:13 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html) 14/05/28 07:11:13 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic) 14/05/28 07:11:13 INFO plugin.PluginRepository: Http / Https Protocol Plug-in (protocol-httpclient) 14/05/28 07:11:13 INFO plugin.PluginRepository: HTTP Framework (lib-http) 14/05/28 07:11:13 INFO plugin.PluginRepository: Creative Commons Plugins (creativecommons) 14/05/28 07:11:13 INFO plugin.PluginRepository: More Indexing Filter (index-more) 14/05/28 07:11:13 INFO plugin.PluginRepository: Regex URL Filter (urlfilter-regex) 14/05/28 07:11:13 INFO plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass) 14/05/28 07:11:13 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex) 14/05/28 07:11:13 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic) 14/05/28 07:11:13 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml) 14/05/28 07:11:13 INFO plugin.PluginRepository: JavaScript Parser (parse-js) 14/05/28 07:11:13 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter) 14/05/28 07:11:13 INFO plugin.PluginRepository: Registered Extension-Points: 14/05/28 07:11:13 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 14/05/28 07:11:13 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol) 14/05/28 07:11:13 INFO plugin.PluginRepository: Parse Filter (org.apache.nutch.parse.ParseFilter) 14/05/28 07:11:13 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter) 14/05/28 07:11:13 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 14/05/28 07:11:13 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser) 14/05/28 07:11:13 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 14/05/28 07:11:16 INFO mapred.JobClient: Running job: job_201405280024_0019 14/05/28 07:11:17 INFO mapred.JobClient: map 0% reduce 0% 14/05/28 07:11:28 INFO mapred.JobClient: map 100% reduce 0% 14/05/28 07:11:38 INFO mapred.JobClient: map 100% reduce 16% 14/05/28 07:11:39 INFO mapred.JobClient: map 100% reduce 50% 14/05/28 07:19:00 INFO mapred.JobClient: Task Id : attempt_201405280024_0019_r_000001_0, Status : FAILED Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. 14/05/28 07:19:00 WARN mapred.JobClient: Error reading task outputnutch-two-qontifi 14/05/28 07:19:00 WARN mapred.JobClient: Error reading task outputnutch-two-qontifi 14/05/28 07:19:11 INFO mapred.JobClient: map 100% reduce 66% 14/05/28 07:19:12 INFO mapred.JobClient: map 100% reduce 100% 14/05/28 07:19:13 INFO mapred.JobClient: Job complete: job_201405280024_0019 14/05/28 07:19:13 INFO mapred.JobClient: Counters: 27 14/05/28 07:19:13 INFO mapred.JobClient: Job Counters 14/05/28 07:19:13 INFO mapred.JobClient: Launched reduce tasks=3 14/05/28 07:19:13 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=10614 14/05/28 07:19:13 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/05/28 07:19:13 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/05/28 07:19:13 INFO mapred.JobClient: Launched map tasks=1 14/05/28 07:19:13 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=23263 14/05/28 07:19:13 INFO mapred.JobClient: File Output Format Counters 14/05/28 07:19:13 INFO mapred.JobClient: Bytes Written=0 14/05/28 07:19:13 INFO mapred.JobClient: FileSystemCounters 14/05/28 07:19:13 INFO mapred.JobClient: FILE_BYTES_READ=44 14/05/28 07:19:13 INFO mapred.JobClient: HDFS_BYTES_READ=910 14/05/28 07:19:13 INFO mapred.JobClient: FILE_BYTES_WRITTEN=238016 14/05/28 07:19:13 INFO mapred.JobClient: File Input Format Counters 14/05/28 07:19:13 INFO mapred.JobClient: Bytes Read=0 14/05/28 07:19:13 INFO mapred.JobClient: Map-Reduce Framework 14/05/28 07:19:13 INFO mapred.JobClient: Map output materialized bytes=28 14/05/28 07:19:13 INFO mapred.JobClient: Map input records=0 14/05/28 07:19:13 INFO mapred.JobClient: Reduce shuffle bytes=28 14/05/28 07:19:13 INFO mapred.JobClient: Spilled Records=0 14/05/28 07:19:13 INFO mapred.JobClient: Map output bytes=0 14/05/28 07:19:13 INFO mapred.JobClient: Total committed heap usage (bytes)=293601280 14/05/28 07:19:13 INFO mapred.JobClient: CPU time spent (ms)=6540 14/05/28 07:19:13 INFO mapred.JobClient: Combine input records=0 14/05/28 07:19:13 INFO mapred.JobClient: SPLIT_RAW_BYTES=910 14/05/28 07:19:13 INFO mapred.JobClient: Reduce input records=0 14/05/28 07:19:13 INFO mapred.JobClient: Reduce input groups=0 14/05/28 07:19:13 INFO mapred.JobClient: Combine output records=0 14/05/28 07:19:13 INFO mapred.JobClient: Physical memory (bytes) snapshot=470159360 14/05/28 07:19:13 INFO mapred.JobClient: Reduce output records=0 14/05/28 07:19:13 INFO mapred.JobClient: Virtual memory (bytes) snapshot=5987823616 14/05/28 07:19:13 INFO mapred.JobClient: Map output records=0 14/05/28 07:19:13 INFO crawl.DbUpdaterJob: DbUpdaterJob: done Indexing TestCrawl on SOLR index -> http://128.199.207.54:8983/solr/nutch Warning: $HADOOP_HOME is deprecated. 14/05/28 07:19:16 INFO solr.SolrIndexerJob: SolrIndexerJob: starting 14/05/28 07:19:16 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar5241938989393377870/classes/plugins 14/05/28 07:19:16 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true] 14/05/28 07:19:16 INFO plugin.PluginRepository: Registered Plugins: 14/05/28 07:19:16 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints) 14/05/28 07:19:16 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic) 14/05/28 07:19:16 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html) 14/05/28 07:19:16 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic) 14/05/28 07:19:16 INFO plugin.PluginRepository: Http / Https Protocol Plug-in (protocol-httpclient) 14/05/28 07:19:16 INFO plugin.PluginRepository: HTTP Framework (lib-http) 14/05/28 07:19:16 INFO plugin.PluginRepository: Creative Commons Plugins (creativecommons) 14/05/28 07:19:16 INFO plugin.PluginRepository: More Indexing Filter (index-more) 14/05/28 07:19:16 INFO plugin.PluginRepository: Regex URL Filter (urlfilter-regex) 14/05/28 07:19:16 INFO plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass) 14/05/28 07:19:16 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex) 14/05/28 07:19:16 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic) 14/05/28 07:19:16 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml) 14/05/28 07:19:16 INFO plugin.PluginRepository: JavaScript Parser (parse-js) 14/05/28 07:19:16 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter) 14/05/28 07:19:16 INFO plugin.PluginRepository: Registered Extension-Points: 14/05/28 07:19:16 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 14/05/28 07:19:16 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol) 14/05/28 07:19:16 INFO plugin.PluginRepository: Parse Filter (org.apache.nutch.parse.ParseFilter) 14/05/28 07:19:16 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter) 14/05/28 07:19:16 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 14/05/28 07:19:16 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser) 14/05/28 07:19:16 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 14/05/28 07:19:16 INFO basic.BasicIndexingFilter: Maximum title length for indexing set to: 100 14/05/28 07:19:16 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 14/05/28 07:19:16 INFO indexer.IndexingFilters: Adding org.creativecommons.nutch.CCIndexingFilter 14/05/28 07:19:17 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.more.MoreIndexingFilter 14/05/28 07:19:21 INFO mapred.JobClient: Running job: job_201405280024_0020 14/05/28 07:19:22 INFO mapred.JobClient: map 0% reduce 0% 14/05/28 07:19:31 INFO mapred.JobClient: map 100% reduce 0% 14/05/28 07:19:33 INFO mapred.JobClient: Job complete: job_201405280024_0020 14/05/28 07:19:33 INFO mapred.JobClient: Counters: 17 14/05/28 07:19:33 INFO mapred.JobClient: Job Counters 14/05/28 07:19:33 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=9290 14/05/28 07:19:33 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/05/28 07:19:33 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/05/28 07:19:33 INFO mapred.JobClient: Launched map tasks=1 14/05/28 07:19:33 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 14/05/28 07:19:33 INFO mapred.JobClient: File Output Format Counters 14/05/28 07:19:33 INFO mapred.JobClient: Bytes Written=0 14/05/28 07:19:33 INFO mapred.JobClient: FileSystemCounters 14/05/28 07:19:33 INFO mapred.JobClient: HDFS_BYTES_READ=877 14/05/28 07:19:33 INFO mapred.JobClient: FILE_BYTES_WRITTEN=79006 14/05/28 07:19:33 INFO mapred.JobClient: File Input Format Counters 14/05/28 07:19:33 INFO mapred.JobClient: Bytes Read=0 14/05/28 07:19:33 INFO mapred.JobClient: Map-Reduce Framework 14/05/28 07:19:33 INFO mapred.JobClient: Map input records=0 14/05/28 07:19:33 INFO mapred.JobClient: Physical memory (bytes) snapshot=117587968 14/05/28 07:19:33 INFO mapred.JobClient: Spilled Records=0 14/05/28 07:19:33 INFO mapred.JobClient: CPU time spent (ms)=1040 14/05/28 07:19:33 INFO mapred.JobClient: Total committed heap usage (bytes)=59768832 14/05/28 07:19:33 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1992785920 14/05/28 07:19:33 INFO mapred.JobClient: Map output records=0 14/05/28 07:19:33 INFO mapred.JobClient: SPLIT_RAW_BYTES=877 14/05/28 07:19:33 INFO solr.SolrIndexerJob: SolrIndexerJob: done. Am I missing anything? -- Manikandan Saravanan Architect - Technology TheSocialPeople

