which version of Nutch are you using? Nutch 2 what?
On Thu, Jun 5, 2014 at 12:14 PM, Manikandan Saravanan < [email protected]> wrote: > Dear Lewis, > > I’m running Nutch 2 on a Hadoop 1.2.1 cluster (2 nodes). I’m using > Cassandra as my backend datastore . I’m trying to crawl one link as of now. > The inject command works properly: I’m able to find one row added to the > “webpage” keyspace in Cassandra. But the generator doesn’t do a thing. So > does the fetcher. In the end, nothing’s indexed in Solr. > > Please help me out. My stack trace is: > > hduser@nutch-one-qontifi:/usr/local/nutch$ bin/crawl urls/seed.txt > TestCrawl http://10.130.231.16:8983/solr/nutch 2 > Warning: $HADOOP_HOME is deprecated. > > 14/06/05 15:00:34 INFO crawl.InjectorJob: InjectorJob: starting at > 2014-06-05 15:00:34 > 14/06/05 15:00:34 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: > urls/seed.txt > 14/06/05 15:00:36 INFO connection.CassandraHostRetryService: Downed Host > Retry service started with queue size -1 and retry delay 10s > 14/06/05 15:00:40 INFO service.JmxMonitor: Registering JMX > me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector > 14/06/05 15:00:41 INFO crawl.InjectorJob: InjectorJob: Using class > org.apache.gora.cassandra.store.CassandraStore as the Gora storage class. > 14/06/05 15:00:44 INFO input.FileInputFormat: Total input paths to process > : 1 > 14/06/05 15:00:44 INFO util.NativeCodeLoader: Loaded the native-hadoop > library > 14/06/05 15:00:44 WARN snappy.LoadSnappy: Snappy native library not loaded > 14/06/05 15:00:44 INFO mapred.JobClient: Running job: job_201406051410_0011 > 14/06/05 15:00:45 INFO mapred.JobClient: map 0% reduce 0% > 14/06/05 15:01:00 INFO mapred.JobClient: map 100% reduce 0% > 14/06/05 15:01:02 INFO mapred.JobClient: Job complete: > job_201406051410_0011 > 14/06/05 15:01:02 INFO mapred.JobClient: Counters: 19 > 14/06/05 15:01:02 INFO mapred.JobClient: Job Counters > 14/06/05 15:01:02 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=14861 > 14/06/05 15:01:02 INFO mapred.JobClient: Total time spent by all > reduces waiting after reserving slots (ms)=0 > 14/06/05 15:01:02 INFO mapred.JobClient: Total time spent by all maps > waiting after reserving slots (ms)=0 > 14/06/05 15:01:02 INFO mapred.JobClient: Launched map tasks=1 > 14/06/05 15:01:02 INFO mapred.JobClient: Data-local map tasks=1 > 14/06/05 15:01:02 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 > 14/06/05 15:01:02 INFO mapred.JobClient: File Output Format Counters > 14/06/05 15:01:02 INFO mapred.JobClient: Bytes Written=0 > 14/06/05 15:01:02 INFO mapred.JobClient: injector > 14/06/05 15:01:02 INFO mapred.JobClient: urls_injected=1 > 14/06/05 15:01:02 INFO mapred.JobClient: FileSystemCounters > 14/06/05 15:01:02 INFO mapred.JobClient: HDFS_BYTES_READ=135 > 14/06/05 15:01:02 INFO mapred.JobClient: FILE_BYTES_WRITTEN=77648 > 14/06/05 15:01:02 INFO mapred.JobClient: File Input Format Counters > 14/06/05 15:01:02 INFO mapred.JobClient: Bytes Read=25 > 14/06/05 15:01:02 INFO mapred.JobClient: Map-Reduce Framework > 14/06/05 15:01:02 INFO mapred.JobClient: Map input records=1 > 14/06/05 15:01:02 INFO mapred.JobClient: Physical memory (bytes) > snapshot=122052608 > 14/06/05 15:01:02 INFO mapred.JobClient: Spilled Records=0 > 14/06/05 15:01:02 INFO mapred.JobClient: CPU time spent (ms)=1490 > 14/06/05 15:01:02 INFO mapred.JobClient: Total committed heap usage > (bytes)=58195968 > 14/06/05 15:01:02 INFO mapred.JobClient: Virtual memory (bytes) > snapshot=1119281152 > 14/06/05 15:01:02 INFO mapred.JobClient: Map output records=1 > 14/06/05 15:01:02 INFO mapred.JobClient: SPLIT_RAW_BYTES=110 > 14/06/05 15:01:02 INFO crawl.InjectorJob: InjectorJob: total number of > urls rejected by filters: 0 > 14/06/05 15:01:02 INFO crawl.InjectorJob: InjectorJob: total number of > urls injected after normalization and filtering: 1 > 14/06/05 15:01:02 INFO crawl.InjectorJob: Injector: finished at 2014-06-05 > 15:01:02, elapsed: 00:00:28 > Thu Jun 5 15:01:02 EDT 2014 : Iteration 1 of 2 > Generating batchId > Generating a new fetchlist > Warning: $HADOOP_HOME is deprecated. > > 14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: starting at > 2014-06-05 15:01:06 > 14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: Selecting > best-scoring urls due for fetch. > 14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: starting > 14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: filtering: false > 14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: normalizing: false > 14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: topN: 50000 > 14/06/05 15:01:06 INFO crawl.FetchScheduleFactory: Using FetchSchedule > impl: org.apache.nutch.crawl.DefaultFetchSchedule > 14/06/05 15:01:06 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000 > 14/06/05 15:01:06 INFO crawl.AbstractFetchSchedule: maxInterval=7776000 > 14/06/05 15:01:07 INFO connection.CassandraHostRetryService: Downed Host > Retry service started with queue size -1 and retry delay 10s > 14/06/05 15:01:11 INFO service.JmxMonitor: Registering JMX > me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector > 14/06/05 15:01:15 INFO mapred.JobClient: Running job: job_201406051410_0012 > 14/06/05 15:01:16 INFO mapred.JobClient: map 0% reduce 0% > 14/06/05 15:01:55 INFO mapred.JobClient: map 100% reduce 0% > 14/06/05 15:02:05 INFO mapred.JobClient: map 100% reduce 33% > 14/06/05 15:02:08 INFO mapred.JobClient: map 100% reduce 66% > 14/06/05 15:02:10 INFO mapred.JobClient: map 100% reduce 83% > 14/06/05 15:02:11 INFO mapred.JobClient: map 100% reduce 100% > 14/06/05 15:02:14 INFO mapred.JobClient: Job complete: > job_201406051410_0012 > 14/06/05 15:02:14 INFO mapred.JobClient: Counters: 27 > 14/06/05 15:02:14 INFO mapred.JobClient: Job Counters > 14/06/05 15:02:14 INFO mapred.JobClient: Launched reduce tasks=2 > 14/06/05 15:02:14 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=39990 > 14/06/05 15:02:14 INFO mapred.JobClient: Total time spent by all > reduces waiting after reserving slots (ms)=0 > 14/06/05 15:02:14 INFO mapred.JobClient: Total time spent by all maps > waiting after reserving slots (ms)=0 > 14/06/05 15:02:14 INFO mapred.JobClient: Launched map tasks=1 > 14/06/05 15:02:14 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=29119 > 14/06/05 15:02:14 INFO mapred.JobClient: File Output Format Counters > 14/06/05 15:02:14 INFO mapred.JobClient: Bytes Written=0 > 14/06/05 15:02:14 INFO mapred.JobClient: FileSystemCounters > 14/06/05 15:02:14 INFO mapred.JobClient: FILE_BYTES_READ=44 > 14/06/05 15:02:14 INFO mapred.JobClient: HDFS_BYTES_READ=951 > 14/06/05 15:02:14 INFO mapred.JobClient: FILE_BYTES_WRITTEN=239453 > 14/06/05 15:02:14 INFO mapred.JobClient: File Input Format Counters > 14/06/05 15:02:14 INFO mapred.JobClient: Bytes Read=0 > 14/06/05 15:02:14 INFO mapred.JobClient: Map-Reduce Framework > 14/06/05 15:02:14 INFO mapred.JobClient: Map output materialized > bytes=28 > 14/06/05 15:02:14 INFO mapred.JobClient: Map input records=0 > 14/06/05 15:02:14 INFO mapred.JobClient: Reduce shuffle bytes=28 > 14/06/05 15:02:14 INFO mapred.JobClient: Spilled Records=0 > 14/06/05 15:02:14 INFO mapred.JobClient: Map output bytes=0 > 14/06/05 15:02:14 INFO mapred.JobClient: Total committed heap usage > (bytes)=333971456 > 14/06/05 15:02:14 INFO mapred.JobClient: CPU time spent (ms)=9330 > 14/06/05 15:02:14 INFO mapred.JobClient: Combine input records=0 > 14/06/05 15:02:14 INFO mapred.JobClient: SPLIT_RAW_BYTES=951 > 14/06/05 15:02:14 INFO mapred.JobClient: Reduce input records=0 > 14/06/05 15:02:14 INFO mapred.JobClient: Reduce input groups=0 > 14/06/05 15:02:14 INFO mapred.JobClient: Combine output records=0 > 14/06/05 15:02:14 INFO mapred.JobClient: Physical memory (bytes) > snapshot=486813696 > 14/06/05 15:02:14 INFO mapred.JobClient: Reduce output records=0 > 14/06/05 15:02:14 INFO mapred.JobClient: Virtual memory (bytes) > snapshot=6016212992 > 14/06/05 15:02:14 INFO mapred.JobClient: Map output records=0 > 14/06/05 15:02:14 INFO crawl.GeneratorJob: GeneratorJob: finished at > 2014-06-05 15:02:14, time elapsed: 00:01:08 > 14/06/05 15:02:14 INFO crawl.GeneratorJob: GeneratorJob: generated batch > id: 1401994862-29963 > Fetching : > Warning: $HADOOP_HOME is deprecated. > > 14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: starting > 14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: batchId: > 1401994862-29963 > 14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: threads: 50 > 14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: parsing: false > 14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: resuming: false > 14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob : timelimit set for > : 1402005738902 > 14/06/05 15:02:19 INFO plugin.PluginRepository: Plugins: looking in: > /app/hadoop/tmp/hadoop-unjar813633856909664022/classes/plugins > 14/06/05 15:02:20 INFO plugin.PluginRepository: Plugin Auto-activation > mode: [true] > 14/06/05 15:02:20 INFO plugin.PluginRepository: Registered Plugins: > 14/06/05 15:02:20 INFO plugin.PluginRepository: the nutch core extension > points (nutch-extensionpoints) > 14/06/05 15:02:20 INFO plugin.PluginRepository: Regex URL Normalizer > (urlnormalizer-regex) > 14/06/05 15:02:20 INFO plugin.PluginRepository: CyberNeko HTML Parser > (lib-nekohtml) > 14/06/05 15:02:20 INFO plugin.PluginRepository: OPIC Scoring Plug-in > (scoring-opic) > 14/06/05 15:02:20 INFO plugin.PluginRepository: Basic URL Normalizer > (urlnormalizer-basic) > 14/06/05 15:02:20 INFO plugin.PluginRepository: Tika Parser Plug-in > (parse-tika) > 14/06/05 15:02:20 INFO plugin.PluginRepository: Basic Indexing Filter > (index-basic) > 14/06/05 15:02:20 INFO plugin.PluginRepository: Html Parse Plug-in > (parse-html) > 14/06/05 15:02:20 INFO plugin.PluginRepository: Anchor Indexing Filter > (index-anchor) > 14/06/05 15:02:20 INFO plugin.PluginRepository: HTTP Framework (lib-http) > 14/06/05 15:02:20 INFO plugin.PluginRepository: Regex URL Filter > (urlfilter-regex) > 14/06/05 15:02:20 INFO plugin.PluginRepository: Regex URL Filter > Framework (lib-regex-filter) > 14/06/05 15:02:20 INFO plugin.PluginRepository: Pass-through URL > Normalizer (urlnormalizer-pass) > 14/06/05 15:02:20 INFO plugin.PluginRepository: Http Protocol Plug-in > (protocol-http) > 14/06/05 15:02:20 INFO plugin.PluginRepository: Registered > Extension-Points: > 14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch URL Normalizer > (org.apache.nutch.net.URLNormalizer) > 14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch Protocol > (org.apache.nutch.protocol.Protocol) > 14/06/05 15:02:20 INFO plugin.PluginRepository: Parse Filter > (org.apache.nutch.parse.ParseFilter) > 14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch URL Filter > (org.apache.nutch.net.URLFilter) > 14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch Indexing Filter > (org.apache.nutch.indexer.IndexingFilter) > 14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch Content Parser > (org.apache.nutch.parse.Parser) > 14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch Scoring > (org.apache.nutch.scoring.ScoringFilter) > 14/06/05 15:02:20 INFO http.Http: http.proxy.host = null > 14/06/05 15:02:20 INFO http.Http: http.proxy.port = 8080 > 14/06/05 15:02:20 INFO http.Http: http.timeout = 10000 > 14/06/05 15:02:20 INFO http.Http: http.content.limit = 65536 > 14/06/05 15:02:20 INFO http.Http: http.agent = Qontifi/Nutch-2.2.1 (A big > data analytics and social media intelligence platform; http://qontifi.com; > manikandan at thesocialpeople dot net) > 14/06/05 15:02:20 INFO http.Http: http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > 14/06/05 15:02:20 INFO http.Http: http.accept = > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > 14/06/05 15:02:20 INFO connection.CassandraHostRetryService: Downed Host > Retry service started with queue size -1 and retry delay 10s > 14/06/05 15:02:25 INFO service.JmxMonitor: Registering JMX > me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector > 14/06/05 15:02:29 INFO mapred.JobClient: Running job: job_201406051410_0013 > 14/06/05 15:02:30 INFO mapred.JobClient: map 0% reduce 0% > 14/06/05 15:03:05 INFO mapred.JobClient: map 100% reduce 0% > 14/06/05 15:03:14 INFO mapred.JobClient: map 100% reduce 16% > 14/06/05 15:03:16 INFO mapred.JobClient: map 100% reduce 33% > 14/06/05 15:03:17 INFO mapred.JobClient: map 100% reduce 50% > 14/06/05 15:03:19 INFO mapred.JobClient: map 100% reduce 66% > 14/06/05 15:03:23 INFO mapred.JobClient: map 100% reduce 83% > 14/06/05 15:03:28 INFO mapred.JobClient: map 100% reduce 100% > 14/06/05 15:03:31 INFO mapred.JobClient: Job complete: > job_201406051410_0013 > 14/06/05 15:03:31 INFO mapred.JobClient: Counters: 28 > 14/06/05 15:03:31 INFO mapred.JobClient: Job Counters > 14/06/05 15:03:31 INFO mapred.JobClient: Launched reduce tasks=2 > 14/06/05 15:03:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=37163 > 14/06/05 15:03:31 INFO mapred.JobClient: Total time spent by all > reduces waiting after reserving slots (ms)=0 > 14/06/05 15:03:31 INFO mapred.JobClient: Total time spent by all maps > waiting after reserving slots (ms)=0 > 14/06/05 15:03:31 INFO mapred.JobClient: Launched map tasks=1 > 14/06/05 15:03:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=39755 > 14/06/05 15:03:31 INFO mapred.JobClient: File Output Format Counters > 14/06/05 15:03:31 INFO mapred.JobClient: Bytes Written=0 > 14/06/05 15:03:31 INFO mapred.JobClient: FileSystemCounters > 14/06/05 15:03:31 INFO mapred.JobClient: FILE_BYTES_READ=44 > 14/06/05 15:03:31 INFO mapred.JobClient: HDFS_BYTES_READ=935 > 14/06/05 15:03:31 INFO mapred.JobClient: FILE_BYTES_WRITTEN=237923 > 14/06/05 15:03:31 INFO mapred.JobClient: File Input Format Counters > 14/06/05 15:03:31 INFO mapred.JobClient: Bytes Read=0 > 14/06/05 15:03:31 INFO mapred.JobClient: FetcherStatus > 14/06/05 15:03:31 INFO mapred.JobClient: HitByTimeLimit-QueueFeeder=0 > 14/06/05 15:03:31 INFO mapred.JobClient: Map-Reduce Framework > 14/06/05 15:03:31 INFO mapred.JobClient: Map output materialized > bytes=28 > 14/06/05 15:03:31 INFO mapred.JobClient: Map input records=0 > 14/06/05 15:03:31 INFO mapred.JobClient: Reduce shuffle bytes=28 > 14/06/05 15:03:31 INFO mapred.JobClient: Spilled Records=0 > 14/06/05 15:03:31 INFO mapred.JobClient: Map output bytes=0 > 14/06/05 15:03:31 INFO mapred.JobClient: Total committed heap usage > (bytes)=375914496 > 14/06/05 15:03:31 INFO mapred.JobClient: CPU time spent (ms)=9820 > 14/06/05 15:03:31 INFO mapred.JobClient: Combine input records=0 > 14/06/05 15:03:31 INFO mapred.JobClient: SPLIT_RAW_BYTES=935 > 14/06/05 15:03:31 INFO mapred.JobClient: Reduce input records=0 > 14/06/05 15:03:31 INFO mapred.JobClient: Reduce input groups=0 > 14/06/05 15:03:31 INFO mapred.JobClient: Combine output records=0 > 14/06/05 15:03:31 INFO mapred.JobClient: Physical memory (bytes) > snapshot=510382080 > 14/06/05 15:03:31 INFO mapred.JobClient: Reduce output records=0 > 14/06/05 15:03:31 INFO mapred.JobClient: Virtual memory (bytes) > snapshot=6060650496 > 14/06/05 15:03:31 INFO mapred.JobClient: Map output records=0 > 14/06/05 15:03:31 INFO fetcher.FetcherJob: FetcherJob: done > Parsing : > Warning: $HADOOP_HOME is deprecated. > > 14/06/05 15:03:34 INFO parse.ParserJob: ParserJob: starting > 14/06/05 15:03:34 INFO parse.ParserJob: ParserJob: resuming: false > 14/06/05 15:03:34 INFO parse.ParserJob: ParserJob: forced reparse: false > 14/06/05 15:03:34 INFO parse.ParserJob: ParserJob: batchId: > 1401994862-29963 > 14/06/05 15:03:35 INFO plugin.PluginRepository: Plugins: looking in: > /app/hadoop/tmp/hadoop-unjar8143815380567453850/classes/plugins > 14/06/05 15:03:36 INFO plugin.PluginRepository: Plugin Auto-activation > mode: [true] > 14/06/05 15:03:36 INFO plugin.PluginRepository: Registered Plugins: > 14/06/05 15:03:36 INFO plugin.PluginRepository: the nutch core extension > points (nutch-extensionpoints) > 14/06/05 15:03:36 INFO plugin.PluginRepository: Regex URL Normalizer > (urlnormalizer-regex) > 14/06/05 15:03:36 INFO plugin.PluginRepository: CyberNeko HTML Parser > (lib-nekohtml) > 14/06/05 15:03:36 INFO plugin.PluginRepository: OPIC Scoring Plug-in > (scoring-opic) > 14/06/05 15:03:36 INFO plugin.PluginRepository: Basic URL Normalizer > (urlnormalizer-basic) > 14/06/05 15:03:36 INFO plugin.PluginRepository: Tika Parser Plug-in > (parse-tika) > 14/06/05 15:03:36 INFO plugin.PluginRepository: Basic Indexing Filter > (index-basic) > 14/06/05 15:03:36 INFO plugin.PluginRepository: Html Parse Plug-in > (parse-html) > 14/06/05 15:03:36 INFO plugin.PluginRepository: Anchor Indexing Filter > (index-anchor) > 14/06/05 15:03:36 INFO plugin.PluginRepository: HTTP Framework (lib-http) > 14/06/05 15:03:36 INFO plugin.PluginRepository: Regex URL Filter > (urlfilter-regex) > 14/06/05 15:03:36 INFO plugin.PluginRepository: Regex URL Filter > Framework (lib-regex-filter) > 14/06/05 15:03:36 INFO plugin.PluginRepository: Pass-through URL > Normalizer (urlnormalizer-pass) > 14/06/05 15:03:36 INFO plugin.PluginRepository: Http Protocol Plug-in > (protocol-http) > 14/06/05 15:03:36 INFO plugin.PluginRepository: Registered > Extension-Points: > 14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch URL Normalizer > (org.apache.nutch.net.URLNormalizer) > 14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch Protocol > (org.apache.nutch.protocol.Protocol) > 14/06/05 15:03:36 INFO plugin.PluginRepository: Parse Filter > (org.apache.nutch.parse.ParseFilter) > 14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch URL Filter > (org.apache.nutch.net.URLFilter) > 14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch Indexing Filter > (org.apache.nutch.indexer.IndexingFilter) > 14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch Content Parser > (org.apache.nutch.parse.Parser) > 14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch Scoring > (org.apache.nutch.scoring.ScoringFilter) > 14/06/05 15:03:36 INFO conf.Configuration: found resource > parse-plugins.xml at > file:/app/hadoop/tmp/hadoop-unjar8143815380567453850/parse-plugins.xml > 14/06/05 15:03:36 INFO crawl.SignatureFactory: Using Signature impl: > org.apache.nutch.crawl.MD5Signature > 14/06/05 15:03:37 INFO connection.CassandraHostRetryService: Downed Host > Retry service started with queue size -1 and retry delay 10s > 14/06/05 15:03:41 INFO service.JmxMonitor: Registering JMX > me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector > 14/06/05 15:03:45 INFO mapred.JobClient: Running job: job_201406051410_0014 > 14/06/05 15:03:46 INFO mapred.JobClient: map 0% reduce 0% > 14/06/05 15:04:22 INFO mapred.JobClient: map 100% reduce 0% > 14/06/05 15:04:24 INFO mapred.JobClient: Job complete: > job_201406051410_0014 > 14/06/05 15:04:25 INFO mapred.JobClient: Counters: 17 > 14/06/05 15:04:25 INFO mapred.JobClient: Job Counters > 14/06/05 15:04:25 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=36653 > 14/06/05 15:04:25 INFO mapred.JobClient: Total time spent by all > reduces waiting after reserving slots (ms)=0 > 14/06/05 15:04:25 INFO mapred.JobClient: Total time spent by all maps > waiting after reserving slots (ms)=0 > 14/06/05 15:04:25 INFO mapred.JobClient: Launched map tasks=1 > 14/06/05 15:04:25 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 > 14/06/05 15:04:25 INFO mapred.JobClient: File Output Format Counters > 14/06/05 15:04:25 INFO mapred.JobClient: Bytes Written=0 > 14/06/05 15:04:25 INFO mapred.JobClient: FileSystemCounters > 14/06/05 15:04:25 INFO mapred.JobClient: HDFS_BYTES_READ=979 > 14/06/05 15:04:25 INFO mapred.JobClient: FILE_BYTES_WRITTEN=78853 > 14/06/05 15:04:25 INFO mapred.JobClient: File Input Format Counters > 14/06/05 15:04:25 INFO mapred.JobClient: Bytes Read=0 > 14/06/05 15:04:25 INFO mapred.JobClient: Map-Reduce Framework > 14/06/05 15:04:25 INFO mapred.JobClient: Map input records=0 > 14/06/05 15:04:25 INFO mapred.JobClient: Physical memory (bytes) > snapshot=129826816 > 14/06/05 15:04:25 INFO mapred.JobClient: Spilled Records=0 > 14/06/05 15:04:25 INFO mapred.JobClient: CPU time spent (ms)=2330 > 14/06/05 15:04:25 INFO mapred.JobClient: Total committed heap usage > (bytes)=60817408 > 14/06/05 15:04:25 INFO mapred.JobClient: Virtual memory (bytes) > snapshot=2000629760 > 14/06/05 15:04:25 INFO mapred.JobClient: Map output records=0 > 14/06/05 15:04:25 INFO mapred.JobClient: SPLIT_RAW_BYTES=979 > 14/06/05 15:04:25 INFO parse.ParserJob: ParserJob: success > CrawlDB update for TestCrawl > Warning: $HADOOP_HOME is deprecated. > > 14/06/05 15:04:28 INFO crawl.DbUpdaterJob: DbUpdaterJob: starting > 14/06/05 15:04:29 INFO plugin.PluginRepository: Plugins: looking in: > /app/hadoop/tmp/hadoop-unjar4238316120015868426/classes/plugins > 14/06/05 15:04:29 INFO plugin.PluginRepository: Plugin Auto-activation > mode: [true] > 14/06/05 15:04:29 INFO plugin.PluginRepository: Registered Plugins: > 14/06/05 15:04:29 INFO plugin.PluginRepository: the nutch core extension > points (nutch-extensionpoints) > 14/06/05 15:04:29 INFO plugin.PluginRepository: Regex URL Normalizer > (urlnormalizer-regex) > 14/06/05 15:04:29 INFO plugin.PluginRepository: CyberNeko HTML Parser > (lib-nekohtml) > 14/06/05 15:04:29 INFO plugin.PluginRepository: OPIC Scoring Plug-in > (scoring-opic) > 14/06/05 15:04:29 INFO plugin.PluginRepository: Basic URL Normalizer > (urlnormalizer-basic) > 14/06/05 15:04:29 INFO plugin.PluginRepository: Tika Parser Plug-in > (parse-tika) > 14/06/05 15:04:29 INFO plugin.PluginRepository: Basic Indexing Filter > (index-basic) > 14/06/05 15:04:29 INFO plugin.PluginRepository: Html Parse Plug-in > (parse-html) > 14/06/05 15:04:29 INFO plugin.PluginRepository: Anchor Indexing Filter > (index-anchor) > 14/06/05 15:04:29 INFO plugin.PluginRepository: HTTP Framework (lib-http) > 14/06/05 15:04:29 INFO plugin.PluginRepository: Regex URL Filter > (urlfilter-regex) > 14/06/05 15:04:29 INFO plugin.PluginRepository: Regex URL Filter > Framework (lib-regex-filter) > 14/06/05 15:04:29 INFO plugin.PluginRepository: Pass-through URL > Normalizer (urlnormalizer-pass) > 14/06/05 15:04:29 INFO plugin.PluginRepository: Http Protocol Plug-in > (protocol-http) > 14/06/05 15:04:29 INFO plugin.PluginRepository: Registered > Extension-Points: > 14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch URL Normalizer > (org.apache.nutch.net.URLNormalizer) > 14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch Protocol > (org.apache.nutch.protocol.Protocol) > 14/06/05 15:04:29 INFO plugin.PluginRepository: Parse Filter > (org.apache.nutch.parse.ParseFilter) > 14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch URL Filter > (org.apache.nutch.net.URLFilter) > 14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch Indexing Filter > (org.apache.nutch.indexer.IndexingFilter) > 14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch Content Parser > (org.apache.nutch.parse.Parser) > 14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch Scoring > (org.apache.nutch.scoring.ScoringFilter) > 14/06/05 15:04:30 INFO connection.CassandraHostRetryService: Downed Host > Retry service started with queue size -1 and retry delay 10s > 14/06/05 15:04:34 INFO service.JmxMonitor: Registering JMX > me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector > 14/06/05 15:04:38 INFO mapred.JobClient: Running job: job_201406051410_0015 > 14/06/05 15:04:39 INFO mapred.JobClient: map 0% reduce 0% > 14/06/05 15:05:21 INFO mapred.JobClient: map 100% reduce 0% > 14/06/05 15:05:31 INFO mapred.JobClient: map 100% reduce 33% > 14/06/05 15:05:34 INFO mapred.JobClient: map 100% reduce 66% > 14/06/05 15:05:37 INFO mapred.JobClient: map 100% reduce 100% > 14/06/05 15:05:39 INFO mapred.JobClient: Job complete: > job_201406051410_0015 > 14/06/05 15:05:39 INFO mapred.JobClient: Counters: 27 > 14/06/05 15:05:39 INFO mapred.JobClient: Job Counters > 14/06/05 15:05:39 INFO mapred.JobClient: Launched reduce tasks=2 > 14/06/05 15:05:39 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=39898 > 14/06/05 15:05:39 INFO mapred.JobClient: Total time spent by all > reduces waiting after reserving slots (ms)=0 > 14/06/05 15:05:39 INFO mapred.JobClient: Total time spent by all maps > waiting after reserving slots (ms)=0 > 14/06/05 15:05:39 INFO mapred.JobClient: Launched map tasks=1 > 14/06/05 15:05:39 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=30439 > 14/06/05 15:05:39 INFO mapred.JobClient: File Output Format Counters > 14/06/05 15:05:39 INFO mapred.JobClient: Bytes Written=0 > 14/06/05 15:05:39 INFO mapred.JobClient: FileSystemCounters > 14/06/05 15:05:39 INFO mapred.JobClient: FILE_BYTES_READ=44 > 14/06/05 15:05:39 INFO mapred.JobClient: HDFS_BYTES_READ=1028 > 14/06/05 15:05:39 INFO mapred.JobClient: FILE_BYTES_WRITTEN=237914 > 14/06/05 15:05:39 INFO mapred.JobClient: File Input Format Counters > 14/06/05 15:05:39 INFO mapred.JobClient: Bytes Read=0 > 14/06/05 15:05:39 INFO mapred.JobClient: Map-Reduce Framework > 14/06/05 15:05:39 INFO mapred.JobClient: Map output materialized > bytes=28 > 14/06/05 15:05:39 INFO mapred.JobClient: Map input records=0 > 14/06/05 15:05:39 INFO mapred.JobClient: Reduce shuffle bytes=28 > 14/06/05 15:05:39 INFO mapred.JobClient: Spilled Records=0 > 14/06/05 15:05:39 INFO mapred.JobClient: Map output bytes=0 > 14/06/05 15:05:39 INFO mapred.JobClient: Total committed heap usage > (bytes)=375914496 > 14/06/05 15:05:39 INFO mapred.JobClient: CPU time spent (ms)=8880 > 14/06/05 15:05:39 INFO mapred.JobClient: Combine input records=0 > 14/06/05 15:05:39 INFO mapred.JobClient: SPLIT_RAW_BYTES=1028 > 14/06/05 15:05:39 INFO mapred.JobClient: Reduce input records=0 > 14/06/05 15:05:39 INFO mapred.JobClient: Reduce input groups=0 > 14/06/05 15:05:39 INFO mapred.JobClient: Combine output records=0 > 14/06/05 15:05:39 INFO mapred.JobClient: Physical memory (bytes) > snapshot=490651648 > 14/06/05 15:05:39 INFO mapred.JobClient: Reduce output records=0 > 14/06/05 15:05:39 INFO mapred.JobClient: Virtual memory (bytes) > snapshot=6002880512 > 14/06/05 15:05:39 INFO mapred.JobClient: Map output records=0 > 14/06/05 15:05:39 INFO crawl.DbUpdaterJob: DbUpdaterJob: done > Indexing TestCrawl on SOLR index -> http://10.130.231.16:8983/solr/nutch > Warning: $HADOOP_HOME is deprecated. > > 14/06/05 15:05:43 INFO solr.SolrIndexerJob: SolrIndexerJob: starting > 14/06/05 15:05:44 INFO plugin.PluginRepository: Plugins: looking in: > /app/hadoop/tmp/hadoop-unjar7543842044056940295/classes/plugins > 14/06/05 15:05:44 INFO plugin.PluginRepository: Plugin Auto-activation > mode: [true] > 14/06/05 15:05:44 INFO plugin.PluginRepository: Registered Plugins: > 14/06/05 15:05:44 INFO plugin.PluginRepository: the nutch core extension > points (nutch-extensionpoints) > 14/06/05 15:05:44 INFO plugin.PluginRepository: Regex URL Normalizer > (urlnormalizer-regex) > 14/06/05 15:05:44 INFO plugin.PluginRepository: CyberNeko HTML Parser > (lib-nekohtml) > 14/06/05 15:05:44 INFO plugin.PluginRepository: OPIC Scoring Plug-in > (scoring-opic) > 14/06/05 15:05:44 INFO plugin.PluginRepository: Basic URL Normalizer > (urlnormalizer-basic) > 14/06/05 15:05:44 INFO plugin.PluginRepository: Tika Parser Plug-in > (parse-tika) > 14/06/05 15:05:44 INFO plugin.PluginRepository: Basic Indexing Filter > (index-basic) > 14/06/05 15:05:44 INFO plugin.PluginRepository: Html Parse Plug-in > (parse-html) > 14/06/05 15:05:44 INFO plugin.PluginRepository: Anchor Indexing Filter > (index-anchor) > 14/06/05 15:05:44 INFO plugin.PluginRepository: HTTP Framework (lib-http) > 14/06/05 15:05:44 INFO plugin.PluginRepository: Regex URL Filter > (urlfilter-regex) > 14/06/05 15:05:44 INFO plugin.PluginRepository: Regex URL Filter > Framework (lib-regex-filter) > 14/06/05 15:05:44 INFO plugin.PluginRepository: Pass-through URL > Normalizer (urlnormalizer-pass) > 14/06/05 15:05:44 INFO plugin.PluginRepository: Http Protocol Plug-in > (protocol-http) > 14/06/05 15:05:44 INFO plugin.PluginRepository: Registered > Extension-Points: > 14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch URL Normalizer > (org.apache.nutch.net.URLNormalizer) > 14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch Protocol > (org.apache.nutch.protocol.Protocol) > 14/06/05 15:05:44 INFO plugin.PluginRepository: Parse Filter > (org.apache.nutch.parse.ParseFilter) > 14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch URL Filter > (org.apache.nutch.net.URLFilter) > 14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch Indexing Filter > (org.apache.nutch.indexer.IndexingFilter) > 14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch Content Parser > (org.apache.nutch.parse.Parser) > 14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch Scoring > (org.apache.nutch.scoring.ScoringFilter) > 14/06/05 15:05:44 INFO basic.BasicIndexingFilter: Maximum title length for > indexing set to: 100 > 14/06/05 15:05:44 INFO indexer.IndexingFilters: Adding > org.apache.nutch.indexer.basic.BasicIndexingFilter > 14/06/05 15:05:44 INFO anchor.AnchorIndexingFilter: Anchor deduplication > is: off > 14/06/05 15:05:44 INFO indexer.IndexingFilters: Adding > org.apache.nutch.indexer.anchor.AnchorIndexingFilter > 14/06/05 15:05:45 INFO connection.CassandraHostRetryService: Downed Host > Retry service started with queue size -1 and retry delay 10s > 14/06/05 15:05:49 INFO service.JmxMonitor: Registering JMX > me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector > 14/06/05 15:05:52 INFO mapred.JobClient: Running job: job_201406051410_0016 > 14/06/05 15:05:53 INFO mapred.JobClient: map 0% reduce 0% > 14/06/05 15:06:29 INFO mapred.JobClient: map 100% reduce 0% > 14/06/05 15:06:32 INFO mapred.JobClient: Job complete: > job_201406051410_0016 > 14/06/05 15:06:32 INFO mapred.JobClient: Counters: 17 > 14/06/05 15:06:32 INFO mapred.JobClient: Job Counters > 14/06/05 15:06:32 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=36879 > 14/06/05 15:06:32 INFO mapred.JobClient: Total time spent by all > reduces waiting after reserving slots (ms)=0 > 14/06/05 15:06:32 INFO mapred.JobClient: Total time spent by all maps > waiting after reserving slots (ms)=0 > 14/06/05 15:06:32 INFO mapred.JobClient: Launched map tasks=1 > 14/06/05 15:06:32 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 > 14/06/05 15:06:32 INFO mapred.JobClient: File Output Format Counters > 14/06/05 15:06:32 INFO mapred.JobClient: Bytes Written=0 > 14/06/05 15:06:32 INFO mapred.JobClient: FileSystemCounters > 14/06/05 15:06:32 INFO mapred.JobClient: HDFS_BYTES_READ=962 > 14/06/05 15:06:32 INFO mapred.JobClient: FILE_BYTES_WRITTEN=78923 > 14/06/05 15:06:32 INFO mapred.JobClient: File Input Format Counters > 14/06/05 15:06:32 INFO mapred.JobClient: Bytes Read=0 > 14/06/05 15:06:32 INFO mapred.JobClient: Map-Reduce Framework > 14/06/05 15:06:32 INFO mapred.JobClient: Map input records=0 > 14/06/05 15:06:32 INFO mapred.JobClient: Physical memory (bytes) > snapshot=114335744 > 14/06/05 15:06:32 INFO mapred.JobClient: Spilled Records=0 > 14/06/05 15:06:32 INFO mapred.JobClient: CPU time spent (ms)=2670 > 14/06/05 15:06:32 INFO mapred.JobClient: Total committed heap usage > (bytes)=60293120 > 14/06/05 15:06:32 INFO mapred.JobClient: Virtual memory (bytes) > snapshot=1990189056 > 14/06/05 15:06:32 INFO mapred.JobClient: Map output records=0 > 14/06/05 15:06:32 INFO mapred.JobClient: SPLIT_RAW_BYTES=962 > 14/06/05 15:06:32 INFO solr.SolrIndexerJob: SolrIndexerJob: done. > > When I run readdb -stats, I get: > > hduser@nutch-one-qontifi:/usr/local/nutch$ bin/nutch readdb TestCrawl > -stats > Warning: $HADOOP_HOME is deprecated. > > 14/06/05 15:13:19 INFO crawl.WebTableReader: WebTable statistics start > 14/06/05 15:13:21 INFO connection.CassandraHostRetryService: Downed Host > Retry service started with queue size -1 and retry delay 10s > 14/06/05 15:13:25 INFO service.JmxMonitor: Registering JMX > me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector > 14/06/05 15:13:29 INFO mapred.JobClient: Running job: job_201406051410_0019 > 14/06/05 15:13:30 INFO mapred.JobClient: map 0% reduce 0% > 14/06/05 15:14:06 INFO mapred.JobClient: map 100% reduce 0% > 14/06/05 15:14:15 INFO mapred.JobClient: map 100% reduce 33% > 14/06/05 15:14:17 INFO mapred.JobClient: map 100% reduce 100% > 14/06/05 15:14:19 INFO mapred.JobClient: Job complete: > job_201406051410_0019 > 14/06/05 15:14:19 INFO mapred.JobClient: Counters: 28 > 14/06/05 15:14:19 INFO mapred.JobClient: Job Counters > 14/06/05 15:14:19 INFO mapred.JobClient: Launched reduce tasks=1 > 14/06/05 15:14:19 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=36697 > 14/06/05 15:14:19 INFO mapred.JobClient: Total time spent by all > reduces waiting after reserving slots (ms)=0 > 14/06/05 15:14:19 INFO mapred.JobClient: Total time spent by all maps > waiting after reserving slots (ms)=0 > 14/06/05 15:14:19 INFO mapred.JobClient: Launched map tasks=1 > 14/06/05 15:14:19 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10302 > 14/06/05 15:14:19 INFO mapred.JobClient: File Output Format Counters > 14/06/05 15:14:19 INFO mapred.JobClient: Bytes Written=86 > 14/06/05 15:14:19 INFO mapred.JobClient: FileSystemCounters > 14/06/05 15:14:19 INFO mapred.JobClient: FILE_BYTES_READ=6 > 14/06/05 15:14:19 INFO mapred.JobClient: HDFS_BYTES_READ=1135 > 14/06/05 15:14:19 INFO mapred.JobClient: FILE_BYTES_WRITTEN=157112 > 14/06/05 15:14:19 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=86 > 14/06/05 15:14:19 INFO mapred.JobClient: File Input Format Counters > 14/06/05 15:14:19 INFO mapred.JobClient: Bytes Read=0 > 14/06/05 15:14:19 INFO mapred.JobClient: Map-Reduce Framework > 14/06/05 15:14:19 INFO mapred.JobClient: Map output materialized > bytes=6 > 14/06/05 15:14:19 INFO mapred.JobClient: Map input records=0 > 14/06/05 15:14:19 INFO mapred.JobClient: Reduce shuffle bytes=6 > 14/06/05 15:14:19 INFO mapred.JobClient: Spilled Records=0 > 14/06/05 15:14:19 INFO mapred.JobClient: Map output bytes=0 > 14/06/05 15:14:19 INFO mapred.JobClient: Total committed heap usage > (bytes)=216530944 > 14/06/05 15:14:19 INFO mapred.JobClient: CPU time spent (ms)=2450 > 14/06/05 15:14:19 INFO mapred.JobClient: Combine input records=0 > 14/06/05 15:14:19 INFO mapred.JobClient: SPLIT_RAW_BYTES=1135 > 14/06/05 15:14:19 INFO mapred.JobClient: Reduce input records=0 > 14/06/05 15:14:19 INFO mapred.JobClient: Reduce input groups=0 > 14/06/05 15:14:19 INFO mapred.JobClient: Combine output records=0 > 14/06/05 15:14:19 INFO mapred.JobClient: Physical memory (bytes) > snapshot=320630784 > 14/06/05 15:14:19 INFO mapred.JobClient: Reduce output records=0 > 14/06/05 15:14:19 INFO mapred.JobClient: Virtual memory (bytes) > snapshot=2254024704 > 14/06/05 15:14:19 INFO mapred.JobClient: Map output records=0 > 14/06/05 15:14:19 INFO crawl.WebTableReader: Statistics for WebTable: > 14/06/05 15:14:19 INFO crawl.WebTableReader: jobs: > {db_stats-job_201406051410_0019={jobID=job_201406051410_0019, > jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Job > Counters ={TOTAL_LAUNCHED_REDUCES=1, SLOTS_MILLIS_MAPS=36697, > FALLOW_SLOTS_MILLIS_REDUCES=0, FALLOW_SLOTS_MILLIS_MAPS=0, > TOTAL_LAUNCHED_MAPS=1, SLOTS_MILLIS_REDUCES=10302}, Map-Reduce > Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0, > REDUCE_SHUFFLE_BYTES=6, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0, > COMMITTED_HEAP_BYTES=216530944, CPU_MILLISECONDS=2450, > SPLIT_RAW_BYTES=1135, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0, > REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0, > PHYSICAL_MEMORY_BYTES=320630784, REDUCE_OUTPUT_RECORDS=0, > VIRTUAL_MEMORY_BYTES=2254024704, MAP_OUTPUT_RECORDS=0}, > FileSystemCounters={FILE_BYTES_READ=6, HDFS_BYTES_READ=1135, > FILE_BYTES_WRITTEN=157112, HDFS_BYTES_WRITTEN=86}, File Output Format > Counters ={BYTES_WRITTEN=86}}}} > 14/06/05 15:14:19 INFO crawl.WebTableReader: TOTAL urls: 0 > 14/06/05 15:14:19 INFO crawl.WebTableReader: WebTable statistics: done > 14/06/05 15:14:19 INFO crawl.WebTableReader: jobs: > {db_stats-job_201406051410_0019={jobID=job_201406051410_0019, > jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Job > Counters ={TOTAL_LAUNCHED_REDUCES=1, SLOTS_MILLIS_MAPS=36697, > FALLOW_SLOTS_MILLIS_REDUCES=0, FALLOW_SLOTS_MILLIS_MAPS=0, > TOTAL_LAUNCHED_MAPS=1, SLOTS_MILLIS_REDUCES=10302}, Map-Reduce > Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0, > REDUCE_SHUFFLE_BYTES=6, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0, > COMMITTED_HEAP_BYTES=216530944, CPU_MILLISECONDS=2450, > SPLIT_RAW_BYTES=1135, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0, > REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0, > PHYSICAL_MEMORY_BYTES=320630784, REDUCE_OUTPUT_RECORDS=0, > VIRTUAL_MEMORY_BYTES=2254024704, MAP_OUTPUT_RECORDS=0}, > FileSystemCounters={FILE_BYTES_READ=6, HDFS_BYTES_READ=1135, > FILE_BYTES_WRITTEN=157112, HDFS_BYTES_WRITTEN=86}, File Output Format > Counters ={BYTES_WRITTEN=86}}}} > 14/06/05 15:14:19 INFO crawl.WebTableReader: TOTAL urls: 0 > > -- > Manikandan Saravanan > Architect - Technology > TheSocialPeople <http://thesocialpeople.net> > -- *Lewis*

