I built it from Nutch 2.2.1 (src-tar.gz). -- Manikandan Saravanan Architect - Technology TheSocialPeople
On 6 June 2014 at 1:03:18 am, Lewis John Mcgibbney ([email protected]) wrote: which version of Nutch are you using? Nutch 2 what? On Thu, Jun 5, 2014 at 12:14 PM, Manikandan Saravanan <[email protected]> wrote: Dear Lewis, I’m running Nutch 2 on a Hadoop 1.2.1 cluster (2 nodes). I’m using Cassandra as my backend datastore . I’m trying to crawl one link as of now. The inject command works properly: I’m able to find one row added to the “webpage” keyspace in Cassandra. But the generator doesn’t do a thing. So does the fetcher. In the end, nothing’s indexed in Solr. Please help me out. My stack trace is: hduser@nutch-one-qontifi:/usr/local/nutch$ bin/crawl urls/seed.txt TestCrawl http://10.130.231.16:8983/solr/nutch 2 Warning: $HADOOP_HOME is deprecated. 14/06/05 15:00:34 INFO crawl.InjectorJob: InjectorJob: starting at 2014-06-05 15:00:34 14/06/05 15:00:34 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: urls/seed.txt 14/06/05 15:00:36 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s 14/06/05 15:00:40 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector 14/06/05 15:00:41 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.cassandra.store.CassandraStore as the Gora storage class. 14/06/05 15:00:44 INFO input.FileInputFormat: Total input paths to process : 1 14/06/05 15:00:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/06/05 15:00:44 WARN snappy.LoadSnappy: Snappy native library not loaded 14/06/05 15:00:44 INFO mapred.JobClient: Running job: job_201406051410_0011 14/06/05 15:00:45 INFO mapred.JobClient: map 0% reduce 0% 14/06/05 15:01:00 INFO mapred.JobClient: map 100% reduce 0% 14/06/05 15:01:02 INFO mapred.JobClient: Job complete: job_201406051410_0011 14/06/05 15:01:02 INFO mapred.JobClient: Counters: 19 14/06/05 15:01:02 INFO mapred.JobClient: Job Counters 14/06/05 15:01:02 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=14861 14/06/05 15:01:02 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/06/05 15:01:02 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/06/05 15:01:02 INFO mapred.JobClient: Launched map tasks=1 14/06/05 15:01:02 INFO mapred.JobClient: Data-local map tasks=1 14/06/05 15:01:02 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 14/06/05 15:01:02 INFO mapred.JobClient: File Output Format Counters 14/06/05 15:01:02 INFO mapred.JobClient: Bytes Written=0 14/06/05 15:01:02 INFO mapred.JobClient: injector 14/06/05 15:01:02 INFO mapred.JobClient: urls_injected=1 14/06/05 15:01:02 INFO mapred.JobClient: FileSystemCounters 14/06/05 15:01:02 INFO mapred.JobClient: HDFS_BYTES_READ=135 14/06/05 15:01:02 INFO mapred.JobClient: FILE_BYTES_WRITTEN=77648 14/06/05 15:01:02 INFO mapred.JobClient: File Input Format Counters 14/06/05 15:01:02 INFO mapred.JobClient: Bytes Read=25 14/06/05 15:01:02 INFO mapred.JobClient: Map-Reduce Framework 14/06/05 15:01:02 INFO mapred.JobClient: Map input records=1 14/06/05 15:01:02 INFO mapred.JobClient: Physical memory (bytes) snapshot=122052608 14/06/05 15:01:02 INFO mapred.JobClient: Spilled Records=0 14/06/05 15:01:02 INFO mapred.JobClient: CPU time spent (ms)=1490 14/06/05 15:01:02 INFO mapred.JobClient: Total committed heap usage (bytes)=58195968 14/06/05 15:01:02 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1119281152 14/06/05 15:01:02 INFO mapred.JobClient: Map output records=1 14/06/05 15:01:02 INFO mapred.JobClient: SPLIT_RAW_BYTES=110 14/06/05 15:01:02 INFO crawl.InjectorJob: InjectorJob: total number of urls rejected by filters: 0 14/06/05 15:01:02 INFO crawl.InjectorJob: InjectorJob: total number of urls injected after normalization and filtering: 1 14/06/05 15:01:02 INFO crawl.InjectorJob: Injector: finished at 2014-06-05 15:01:02, elapsed: 00:00:28 Thu Jun 5 15:01:02 EDT 2014 : Iteration 1 of 2 Generating batchId Generating a new fetchlist Warning: $HADOOP_HOME is deprecated. 14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: starting at 2014-06-05 15:01:06 14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: Selecting best-scoring urls due for fetch. 14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: starting 14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: filtering: false 14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: normalizing: false 14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: topN: 50000 14/06/05 15:01:06 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 14/06/05 15:01:06 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000 14/06/05 15:01:06 INFO crawl.AbstractFetchSchedule: maxInterval=7776000 14/06/05 15:01:07 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s 14/06/05 15:01:11 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector 14/06/05 15:01:15 INFO mapred.JobClient: Running job: job_201406051410_0012 14/06/05 15:01:16 INFO mapred.JobClient: map 0% reduce 0% 14/06/05 15:01:55 INFO mapred.JobClient: map 100% reduce 0% 14/06/05 15:02:05 INFO mapred.JobClient: map 100% reduce 33% 14/06/05 15:02:08 INFO mapred.JobClient: map 100% reduce 66% 14/06/05 15:02:10 INFO mapred.JobClient: map 100% reduce 83% 14/06/05 15:02:11 INFO mapred.JobClient: map 100% reduce 100% 14/06/05 15:02:14 INFO mapred.JobClient: Job complete: job_201406051410_0012 14/06/05 15:02:14 INFO mapred.JobClient: Counters: 27 14/06/05 15:02:14 INFO mapred.JobClient: Job Counters 14/06/05 15:02:14 INFO mapred.JobClient: Launched reduce tasks=2 14/06/05 15:02:14 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=39990 14/06/05 15:02:14 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/06/05 15:02:14 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/06/05 15:02:14 INFO mapred.JobClient: Launched map tasks=1 14/06/05 15:02:14 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=29119 14/06/05 15:02:14 INFO mapred.JobClient: File Output Format Counters 14/06/05 15:02:14 INFO mapred.JobClient: Bytes Written=0 14/06/05 15:02:14 INFO mapred.JobClient: FileSystemCounters 14/06/05 15:02:14 INFO mapred.JobClient: FILE_BYTES_READ=44 14/06/05 15:02:14 INFO mapred.JobClient: HDFS_BYTES_READ=951 14/06/05 15:02:14 INFO mapred.JobClient: FILE_BYTES_WRITTEN=239453 14/06/05 15:02:14 INFO mapred.JobClient: File Input Format Counters 14/06/05 15:02:14 INFO mapred.JobClient: Bytes Read=0 14/06/05 15:02:14 INFO mapred.JobClient: Map-Reduce Framework 14/06/05 15:02:14 INFO mapred.JobClient: Map output materialized bytes=28 14/06/05 15:02:14 INFO mapred.JobClient: Map input records=0 14/06/05 15:02:14 INFO mapred.JobClient: Reduce shuffle bytes=28 14/06/05 15:02:14 INFO mapred.JobClient: Spilled Records=0 14/06/05 15:02:14 INFO mapred.JobClient: Map output bytes=0 14/06/05 15:02:14 INFO mapred.JobClient: Total committed heap usage (bytes)=333971456 14/06/05 15:02:14 INFO mapred.JobClient: CPU time spent (ms)=9330 14/06/05 15:02:14 INFO mapred.JobClient: Combine input records=0 14/06/05 15:02:14 INFO mapred.JobClient: SPLIT_RAW_BYTES=951 14/06/05 15:02:14 INFO mapred.JobClient: Reduce input records=0 14/06/05 15:02:14 INFO mapred.JobClient: Reduce input groups=0 14/06/05 15:02:14 INFO mapred.JobClient: Combine output records=0 14/06/05 15:02:14 INFO mapred.JobClient: Physical memory (bytes) snapshot=486813696 14/06/05 15:02:14 INFO mapred.JobClient: Reduce output records=0 14/06/05 15:02:14 INFO mapred.JobClient: Virtual memory (bytes) snapshot=6016212992 14/06/05 15:02:14 INFO mapred.JobClient: Map output records=0 14/06/05 15:02:14 INFO crawl.GeneratorJob: GeneratorJob: finished at 2014-06-05 15:02:14, time elapsed: 00:01:08 14/06/05 15:02:14 INFO crawl.GeneratorJob: GeneratorJob: generated batch id: 1401994862-29963 Fetching : Warning: $HADOOP_HOME is deprecated. 14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: starting 14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: batchId: 1401994862-29963 14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: threads: 50 14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: parsing: false 14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: resuming: false 14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob : timelimit set for : 1402005738902 14/06/05 15:02:19 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar813633856909664022/classes/plugins 14/06/05 15:02:20 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true] 14/06/05 15:02:20 INFO plugin.PluginRepository: Registered Plugins: 14/06/05 15:02:20 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints) 14/06/05 15:02:20 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex) 14/06/05 15:02:20 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml) 14/06/05 15:02:20 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic) 14/06/05 15:02:20 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic) 14/06/05 15:02:20 INFO plugin.PluginRepository: Tika Parser Plug-in (parse-tika) 14/06/05 15:02:20 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic) 14/06/05 15:02:20 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html) 14/06/05 15:02:20 INFO plugin.PluginRepository: Anchor Indexing Filter (index-anchor) 14/06/05 15:02:20 INFO plugin.PluginRepository: HTTP Framework (lib-http) 14/06/05 15:02:20 INFO plugin.PluginRepository: Regex URL Filter (urlfilter-regex) 14/06/05 15:02:20 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter) 14/06/05 15:02:20 INFO plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass) 14/06/05 15:02:20 INFO plugin.PluginRepository: Http Protocol Plug-in (protocol-http) 14/06/05 15:02:20 INFO plugin.PluginRepository: Registered Extension-Points: 14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol) 14/06/05 15:02:20 INFO plugin.PluginRepository: Parse Filter (org.apache.nutch.parse.ParseFilter) 14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter) 14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser) 14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 14/06/05 15:02:20 INFO http.Http: http.proxy.host = null 14/06/05 15:02:20 INFO http.Http: http.proxy.port = 8080 14/06/05 15:02:20 INFO http.Http: http.timeout = 10000 14/06/05 15:02:20 INFO http.Http: http.content.limit = 65536 14/06/05 15:02:20 INFO http.Http: http.agent = Qontifi/Nutch-2.2.1 (A big data analytics and social media intelligence platform; http://qontifi.com; manikandan at thesocialpeople dot net) 14/06/05 15:02:20 INFO http.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 14/06/05 15:02:20 INFO http.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 14/06/05 15:02:20 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s 14/06/05 15:02:25 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector 14/06/05 15:02:29 INFO mapred.JobClient: Running job: job_201406051410_0013 14/06/05 15:02:30 INFO mapred.JobClient: map 0% reduce 0% 14/06/05 15:03:05 INFO mapred.JobClient: map 100% reduce 0% 14/06/05 15:03:14 INFO mapred.JobClient: map 100% reduce 16% 14/06/05 15:03:16 INFO mapred.JobClient: map 100% reduce 33% 14/06/05 15:03:17 INFO mapred.JobClient: map 100% reduce 50% 14/06/05 15:03:19 INFO mapred.JobClient: map 100% reduce 66% 14/06/05 15:03:23 INFO mapred.JobClient: map 100% reduce 83% 14/06/05 15:03:28 INFO mapred.JobClient: map 100% reduce 100% 14/06/05 15:03:31 INFO mapred.JobClient: Job complete: job_201406051410_0013 14/06/05 15:03:31 INFO mapred.JobClient: Counters: 28 14/06/05 15:03:31 INFO mapred.JobClient: Job Counters 14/06/05 15:03:31 INFO mapred.JobClient: Launched reduce tasks=2 14/06/05 15:03:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=37163 14/06/05 15:03:31 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/06/05 15:03:31 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/06/05 15:03:31 INFO mapred.JobClient: Launched map tasks=1 14/06/05 15:03:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=39755 14/06/05 15:03:31 INFO mapred.JobClient: File Output Format Counters 14/06/05 15:03:31 INFO mapred.JobClient: Bytes Written=0 14/06/05 15:03:31 INFO mapred.JobClient: FileSystemCounters 14/06/05 15:03:31 INFO mapred.JobClient: FILE_BYTES_READ=44 14/06/05 15:03:31 INFO mapred.JobClient: HDFS_BYTES_READ=935 14/06/05 15:03:31 INFO mapred.JobClient: FILE_BYTES_WRITTEN=237923 14/06/05 15:03:31 INFO mapred.JobClient: File Input Format Counters 14/06/05 15:03:31 INFO mapred.JobClient: Bytes Read=0 14/06/05 15:03:31 INFO mapred.JobClient: FetcherStatus 14/06/05 15:03:31 INFO mapred.JobClient: HitByTimeLimit-QueueFeeder=0 14/06/05 15:03:31 INFO mapred.JobClient: Map-Reduce Framework 14/06/05 15:03:31 INFO mapred.JobClient: Map output materialized bytes=28 14/06/05 15:03:31 INFO mapred.JobClient: Map input records=0 14/06/05 15:03:31 INFO mapred.JobClient: Reduce shuffle bytes=28 14/06/05 15:03:31 INFO mapred.JobClient: Spilled Records=0 14/06/05 15:03:31 INFO mapred.JobClient: Map output bytes=0 14/06/05 15:03:31 INFO mapred.JobClient: Total committed heap usage (bytes)=375914496 14/06/05 15:03:31 INFO mapred.JobClient: CPU time spent (ms)=9820 14/06/05 15:03:31 INFO mapred.JobClient: Combine input records=0 14/06/05 15:03:31 INFO mapred.JobClient: SPLIT_RAW_BYTES=935 14/06/05 15:03:31 INFO mapred.JobClient: Reduce input records=0 14/06/05 15:03:31 INFO mapred.JobClient: Reduce input groups=0 14/06/05 15:03:31 INFO mapred.JobClient: Combine output records=0 14/06/05 15:03:31 INFO mapred.JobClient: Physical memory (bytes) snapshot=510382080 14/06/05 15:03:31 INFO mapred.JobClient: Reduce output records=0 14/06/05 15:03:31 INFO mapred.JobClient: Virtual memory (bytes) snapshot=6060650496 14/06/05 15:03:31 INFO mapred.JobClient: Map output records=0 14/06/05 15:03:31 INFO fetcher.FetcherJob: FetcherJob: done Parsing : Warning: $HADOOP_HOME is deprecated. 14/06/05 15:03:34 INFO parse.ParserJob: ParserJob: starting 14/06/05 15:03:34 INFO parse.ParserJob: ParserJob: resuming: false 14/06/05 15:03:34 INFO parse.ParserJob: ParserJob: forced reparse: false 14/06/05 15:03:34 INFO parse.ParserJob: ParserJob: batchId: 1401994862-29963 14/06/05 15:03:35 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar8143815380567453850/classes/plugins 14/06/05 15:03:36 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true] 14/06/05 15:03:36 INFO plugin.PluginRepository: Registered Plugins: 14/06/05 15:03:36 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints) 14/06/05 15:03:36 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex) 14/06/05 15:03:36 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml) 14/06/05 15:03:36 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic) 14/06/05 15:03:36 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic) 14/06/05 15:03:36 INFO plugin.PluginRepository: Tika Parser Plug-in (parse-tika) 14/06/05 15:03:36 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic) 14/06/05 15:03:36 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html) 14/06/05 15:03:36 INFO plugin.PluginRepository: Anchor Indexing Filter (index-anchor) 14/06/05 15:03:36 INFO plugin.PluginRepository: HTTP Framework (lib-http) 14/06/05 15:03:36 INFO plugin.PluginRepository: Regex URL Filter (urlfilter-regex) 14/06/05 15:03:36 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter) 14/06/05 15:03:36 INFO plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass) 14/06/05 15:03:36 INFO plugin.PluginRepository: Http Protocol Plug-in (protocol-http) 14/06/05 15:03:36 INFO plugin.PluginRepository: Registered Extension-Points: 14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol) 14/06/05 15:03:36 INFO plugin.PluginRepository: Parse Filter (org.apache.nutch.parse.ParseFilter) 14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter) 14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser) 14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 14/06/05 15:03:36 INFO conf.Configuration: found resource parse-plugins.xml at file:/app/hadoop/tmp/hadoop-unjar8143815380567453850/parse-plugins.xml 14/06/05 15:03:36 INFO crawl.SignatureFactory: Using Signature impl: org.apache.nutch.crawl.MD5Signature 14/06/05 15:03:37 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s 14/06/05 15:03:41 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector 14/06/05 15:03:45 INFO mapred.JobClient: Running job: job_201406051410_0014 14/06/05 15:03:46 INFO mapred.JobClient: map 0% reduce 0% 14/06/05 15:04:22 INFO mapred.JobClient: map 100% reduce 0% 14/06/05 15:04:24 INFO mapred.JobClient: Job complete: job_201406051410_0014 14/06/05 15:04:25 INFO mapred.JobClient: Counters: 17 14/06/05 15:04:25 INFO mapred.JobClient: Job Counters 14/06/05 15:04:25 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=36653 14/06/05 15:04:25 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/06/05 15:04:25 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/06/05 15:04:25 INFO mapred.JobClient: Launched map tasks=1 14/06/05 15:04:25 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 14/06/05 15:04:25 INFO mapred.JobClient: File Output Format Counters 14/06/05 15:04:25 INFO mapred.JobClient: Bytes Written=0 14/06/05 15:04:25 INFO mapred.JobClient: FileSystemCounters 14/06/05 15:04:25 INFO mapred.JobClient: HDFS_BYTES_READ=979 14/06/05 15:04:25 INFO mapred.JobClient: FILE_BYTES_WRITTEN=78853 14/06/05 15:04:25 INFO mapred.JobClient: File Input Format Counters 14/06/05 15:04:25 INFO mapred.JobClient: Bytes Read=0 14/06/05 15:04:25 INFO mapred.JobClient: Map-Reduce Framework 14/06/05 15:04:25 INFO mapred.JobClient: Map input records=0 14/06/05 15:04:25 INFO mapred.JobClient: Physical memory (bytes) snapshot=129826816 14/06/05 15:04:25 INFO mapred.JobClient: Spilled Records=0 14/06/05 15:04:25 INFO mapred.JobClient: CPU time spent (ms)=2330 14/06/05 15:04:25 INFO mapred.JobClient: Total committed heap usage (bytes)=60817408 14/06/05 15:04:25 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2000629760 14/06/05 15:04:25 INFO mapred.JobClient: Map output records=0 14/06/05 15:04:25 INFO mapred.JobClient: SPLIT_RAW_BYTES=979 14/06/05 15:04:25 INFO parse.ParserJob: ParserJob: success CrawlDB update for TestCrawl Warning: $HADOOP_HOME is deprecated. 14/06/05 15:04:28 INFO crawl.DbUpdaterJob: DbUpdaterJob: starting 14/06/05 15:04:29 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar4238316120015868426/classes/plugins 14/06/05 15:04:29 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true] 14/06/05 15:04:29 INFO plugin.PluginRepository: Registered Plugins: 14/06/05 15:04:29 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints) 14/06/05 15:04:29 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex) 14/06/05 15:04:29 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml) 14/06/05 15:04:29 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic) 14/06/05 15:04:29 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic) 14/06/05 15:04:29 INFO plugin.PluginRepository: Tika Parser Plug-in (parse-tika) 14/06/05 15:04:29 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic) 14/06/05 15:04:29 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html) 14/06/05 15:04:29 INFO plugin.PluginRepository: Anchor Indexing Filter (index-anchor) 14/06/05 15:04:29 INFO plugin.PluginRepository: HTTP Framework (lib-http) 14/06/05 15:04:29 INFO plugin.PluginRepository: Regex URL Filter (urlfilter-regex) 14/06/05 15:04:29 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter) 14/06/05 15:04:29 INFO plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass) 14/06/05 15:04:29 INFO plugin.PluginRepository: Http Protocol Plug-in (protocol-http) 14/06/05 15:04:29 INFO plugin.PluginRepository: Registered Extension-Points: 14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol) 14/06/05 15:04:29 INFO plugin.PluginRepository: Parse Filter (org.apache.nutch.parse.ParseFilter) 14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter) 14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser) 14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 14/06/05 15:04:30 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s 14/06/05 15:04:34 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector 14/06/05 15:04:38 INFO mapred.JobClient: Running job: job_201406051410_0015 14/06/05 15:04:39 INFO mapred.JobClient: map 0% reduce 0% 14/06/05 15:05:21 INFO mapred.JobClient: map 100% reduce 0% 14/06/05 15:05:31 INFO mapred.JobClient: map 100% reduce 33% 14/06/05 15:05:34 INFO mapred.JobClient: map 100% reduce 66% 14/06/05 15:05:37 INFO mapred.JobClient: map 100% reduce 100% 14/06/05 15:05:39 INFO mapred.JobClient: Job complete: job_201406051410_0015 14/06/05 15:05:39 INFO mapred.JobClient: Counters: 27 14/06/05 15:05:39 INFO mapred.JobClient: Job Counters 14/06/05 15:05:39 INFO mapred.JobClient: Launched reduce tasks=2 14/06/05 15:05:39 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=39898 14/06/05 15:05:39 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/06/05 15:05:39 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/06/05 15:05:39 INFO mapred.JobClient: Launched map tasks=1 14/06/05 15:05:39 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=30439 14/06/05 15:05:39 INFO mapred.JobClient: File Output Format Counters 14/06/05 15:05:39 INFO mapred.JobClient: Bytes Written=0 14/06/05 15:05:39 INFO mapred.JobClient: FileSystemCounters 14/06/05 15:05:39 INFO mapred.JobClient: FILE_BYTES_READ=44 14/06/05 15:05:39 INFO mapred.JobClient: HDFS_BYTES_READ=1028 14/06/05 15:05:39 INFO mapred.JobClient: FILE_BYTES_WRITTEN=237914 14/06/05 15:05:39 INFO mapred.JobClient: File Input Format Counters 14/06/05 15:05:39 INFO mapred.JobClient: Bytes Read=0 14/06/05 15:05:39 INFO mapred.JobClient: Map-Reduce Framework 14/06/05 15:05:39 INFO mapred.JobClient: Map output materialized bytes=28 14/06/05 15:05:39 INFO mapred.JobClient: Map input records=0 14/06/05 15:05:39 INFO mapred.JobClient: Reduce shuffle bytes=28 14/06/05 15:05:39 INFO mapred.JobClient: Spilled Records=0 14/06/05 15:05:39 INFO mapred.JobClient: Map output bytes=0 14/06/05 15:05:39 INFO mapred.JobClient: Total committed heap usage (bytes)=375914496 14/06/05 15:05:39 INFO mapred.JobClient: CPU time spent (ms)=8880 14/06/05 15:05:39 INFO mapred.JobClient: Combine input records=0 14/06/05 15:05:39 INFO mapred.JobClient: SPLIT_RAW_BYTES=1028 14/06/05 15:05:39 INFO mapred.JobClient: Reduce input records=0 14/06/05 15:05:39 INFO mapred.JobClient: Reduce input groups=0 14/06/05 15:05:39 INFO mapred.JobClient: Combine output records=0 14/06/05 15:05:39 INFO mapred.JobClient: Physical memory (bytes) snapshot=490651648 14/06/05 15:05:39 INFO mapred.JobClient: Reduce output records=0 14/06/05 15:05:39 INFO mapred.JobClient: Virtual memory (bytes) snapshot=6002880512 14/06/05 15:05:39 INFO mapred.JobClient: Map output records=0 14/06/05 15:05:39 INFO crawl.DbUpdaterJob: DbUpdaterJob: done Indexing TestCrawl on SOLR index -> http://10.130.231.16:8983/solr/nutch Warning: $HADOOP_HOME is deprecated. 14/06/05 15:05:43 INFO solr.SolrIndexerJob: SolrIndexerJob: starting 14/06/05 15:05:44 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar7543842044056940295/classes/plugins 14/06/05 15:05:44 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true] 14/06/05 15:05:44 INFO plugin.PluginRepository: Registered Plugins: 14/06/05 15:05:44 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints) 14/06/05 15:05:44 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex) 14/06/05 15:05:44 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml) 14/06/05 15:05:44 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic) 14/06/05 15:05:44 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic) 14/06/05 15:05:44 INFO plugin.PluginRepository: Tika Parser Plug-in (parse-tika) 14/06/05 15:05:44 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic) 14/06/05 15:05:44 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html) 14/06/05 15:05:44 INFO plugin.PluginRepository: Anchor Indexing Filter (index-anchor) 14/06/05 15:05:44 INFO plugin.PluginRepository: HTTP Framework (lib-http) 14/06/05 15:05:44 INFO plugin.PluginRepository: Regex URL Filter (urlfilter-regex) 14/06/05 15:05:44 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter) 14/06/05 15:05:44 INFO plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass) 14/06/05 15:05:44 INFO plugin.PluginRepository: Http Protocol Plug-in (protocol-http) 14/06/05 15:05:44 INFO plugin.PluginRepository: Registered Extension-Points: 14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol) 14/06/05 15:05:44 INFO plugin.PluginRepository: Parse Filter (org.apache.nutch.parse.ParseFilter) 14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter) 14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser) 14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 14/06/05 15:05:44 INFO basic.BasicIndexingFilter: Maximum title length for indexing set to: 100 14/06/05 15:05:44 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 14/06/05 15:05:44 INFO anchor.AnchorIndexingFilter: Anchor deduplication is: off 14/06/05 15:05:44 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 14/06/05 15:05:45 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s 14/06/05 15:05:49 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector 14/06/05 15:05:52 INFO mapred.JobClient: Running job: job_201406051410_0016 14/06/05 15:05:53 INFO mapred.JobClient: map 0% reduce 0% 14/06/05 15:06:29 INFO mapred.JobClient: map 100% reduce 0% 14/06/05 15:06:32 INFO mapred.JobClient: Job complete: job_201406051410_0016 14/06/05 15:06:32 INFO mapred.JobClient: Counters: 17 14/06/05 15:06:32 INFO mapred.JobClient: Job Counters 14/06/05 15:06:32 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=36879 14/06/05 15:06:32 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/06/05 15:06:32 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/06/05 15:06:32 INFO mapred.JobClient: Launched map tasks=1 14/06/05 15:06:32 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 14/06/05 15:06:32 INFO mapred.JobClient: File Output Format Counters 14/06/05 15:06:32 INFO mapred.JobClient: Bytes Written=0 14/06/05 15:06:32 INFO mapred.JobClient: FileSystemCounters 14/06/05 15:06:32 INFO mapred.JobClient: HDFS_BYTES_READ=962 14/06/05 15:06:32 INFO mapred.JobClient: FILE_BYTES_WRITTEN=78923 14/06/05 15:06:32 INFO mapred.JobClient: File Input Format Counters 14/06/05 15:06:32 INFO mapred.JobClient: Bytes Read=0 14/06/05 15:06:32 INFO mapred.JobClient: Map-Reduce Framework 14/06/05 15:06:32 INFO mapred.JobClient: Map input records=0 14/06/05 15:06:32 INFO mapred.JobClient: Physical memory (bytes) snapshot=114335744 14/06/05 15:06:32 INFO mapred.JobClient: Spilled Records=0 14/06/05 15:06:32 INFO mapred.JobClient: CPU time spent (ms)=2670 14/06/05 15:06:32 INFO mapred.JobClient: Total committed heap usage (bytes)=60293120 14/06/05 15:06:32 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1990189056 14/06/05 15:06:32 INFO mapred.JobClient: Map output records=0 14/06/05 15:06:32 INFO mapred.JobClient: SPLIT_RAW_BYTES=962 14/06/05 15:06:32 INFO solr.SolrIndexerJob: SolrIndexerJob: done. When I run readdb -stats, I get: hduser@nutch-one-qontifi:/usr/local/nutch$ bin/nutch readdb TestCrawl -stats Warning: $HADOOP_HOME is deprecated. 14/06/05 15:13:19 INFO crawl.WebTableReader: WebTable statistics start 14/06/05 15:13:21 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s 14/06/05 15:13:25 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector 14/06/05 15:13:29 INFO mapred.JobClient: Running job: job_201406051410_0019 14/06/05 15:13:30 INFO mapred.JobClient: map 0% reduce 0% 14/06/05 15:14:06 INFO mapred.JobClient: map 100% reduce 0% 14/06/05 15:14:15 INFO mapred.JobClient: map 100% reduce 33% 14/06/05 15:14:17 INFO mapred.JobClient: map 100% reduce 100% 14/06/05 15:14:19 INFO mapred.JobClient: Job complete: job_201406051410_0019 14/06/05 15:14:19 INFO mapred.JobClient: Counters: 28 14/06/05 15:14:19 INFO mapred.JobClient: Job Counters 14/06/05 15:14:19 INFO mapred.JobClient: Launched reduce tasks=1 14/06/05 15:14:19 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=36697 14/06/05 15:14:19 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/06/05 15:14:19 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/06/05 15:14:19 INFO mapred.JobClient: Launched map tasks=1 14/06/05 15:14:19 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10302 14/06/05 15:14:19 INFO mapred.JobClient: File Output Format Counters 14/06/05 15:14:19 INFO mapred.JobClient: Bytes Written=86 14/06/05 15:14:19 INFO mapred.JobClient: FileSystemCounters 14/06/05 15:14:19 INFO mapred.JobClient: FILE_BYTES_READ=6 14/06/05 15:14:19 INFO mapred.JobClient: HDFS_BYTES_READ=1135 14/06/05 15:14:19 INFO mapred.JobClient: FILE_BYTES_WRITTEN=157112 14/06/05 15:14:19 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=86 14/06/05 15:14:19 INFO mapred.JobClient: File Input Format Counters 14/06/05 15:14:19 INFO mapred.JobClient: Bytes Read=0 14/06/05 15:14:19 INFO mapred.JobClient: Map-Reduce Framework 14/06/05 15:14:19 INFO mapred.JobClient: Map output materialized bytes=6 14/06/05 15:14:19 INFO mapred.JobClient: Map input records=0 14/06/05 15:14:19 INFO mapred.JobClient: Reduce shuffle bytes=6 14/06/05 15:14:19 INFO mapred.JobClient: Spilled Records=0 14/06/05 15:14:19 INFO mapred.JobClient: Map output bytes=0 14/06/05 15:14:19 INFO mapred.JobClient: Total committed heap usage (bytes)=216530944 14/06/05 15:14:19 INFO mapred.JobClient: CPU time spent (ms)=2450 14/06/05 15:14:19 INFO mapred.JobClient: Combine input records=0 14/06/05 15:14:19 INFO mapred.JobClient: SPLIT_RAW_BYTES=1135 14/06/05 15:14:19 INFO mapred.JobClient: Reduce input records=0 14/06/05 15:14:19 INFO mapred.JobClient: Reduce input groups=0 14/06/05 15:14:19 INFO mapred.JobClient: Combine output records=0 14/06/05 15:14:19 INFO mapred.JobClient: Physical memory (bytes) snapshot=320630784 14/06/05 15:14:19 INFO mapred.JobClient: Reduce output records=0 14/06/05 15:14:19 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2254024704 14/06/05 15:14:19 INFO mapred.JobClient: Map output records=0 14/06/05 15:14:19 INFO crawl.WebTableReader: Statistics for WebTable: 14/06/05 15:14:19 INFO crawl.WebTableReader: jobs: {db_stats-job_201406051410_0019={jobID=job_201406051410_0019, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Job Counters ={TOTAL_LAUNCHED_REDUCES=1, SLOTS_MILLIS_MAPS=36697, FALLOW_SLOTS_MILLIS_REDUCES=0, FALLOW_SLOTS_MILLIS_MAPS=0, TOTAL_LAUNCHED_MAPS=1, SLOTS_MILLIS_REDUCES=10302}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0, REDUCE_SHUFFLE_BYTES=6, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0, COMMITTED_HEAP_BYTES=216530944, CPU_MILLISECONDS=2450, SPLIT_RAW_BYTES=1135, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0, REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0, PHYSICAL_MEMORY_BYTES=320630784, REDUCE_OUTPUT_RECORDS=0, VIRTUAL_MEMORY_BYTES=2254024704, MAP_OUTPUT_RECORDS=0}, FileSystemCounters={FILE_BYTES_READ=6, HDFS_BYTES_READ=1135, FILE_BYTES_WRITTEN=157112, HDFS_BYTES_WRITTEN=86}, File Output Format Counters ={BYTES_WRITTEN=86}}}} 14/06/05 15:14:19 INFO crawl.WebTableReader: TOTAL urls: 0 14/06/05 15:14:19 INFO crawl.WebTableReader: WebTable statistics: done 14/06/05 15:14:19 INFO crawl.WebTableReader: jobs: {db_stats-job_201406051410_0019={jobID=job_201406051410_0019, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Job Counters ={TOTAL_LAUNCHED_REDUCES=1, SLOTS_MILLIS_MAPS=36697, FALLOW_SLOTS_MILLIS_REDUCES=0, FALLOW_SLOTS_MILLIS_MAPS=0, TOTAL_LAUNCHED_MAPS=1, SLOTS_MILLIS_REDUCES=10302}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0, REDUCE_SHUFFLE_BYTES=6, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0, COMMITTED_HEAP_BYTES=216530944, CPU_MILLISECONDS=2450, SPLIT_RAW_BYTES=1135, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0, REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0, PHYSICAL_MEMORY_BYTES=320630784, REDUCE_OUTPUT_RECORDS=0, VIRTUAL_MEMORY_BYTES=2254024704, MAP_OUTPUT_RECORDS=0}, FileSystemCounters={FILE_BYTES_READ=6, HDFS_BYTES_READ=1135, FILE_BYTES_WRITTEN=157112, HDFS_BYTES_WRITTEN=86}, File Output Format Counters ={BYTES_WRITTEN=86}}}} 14/06/05 15:14:19 INFO crawl.WebTableReader: TOTAL urls: 0 -- Manikandan Saravanan Architect - Technology TheSocialPeople -- Lewis

