I can explain the why the "Input path doesn't exist" error disappeared when you used windows-like paths.
Though I have not used Cygwin since a long time, but I guess your Cygwin distribution wouldn't be having its own java. So when you execute bin/nutch, it must be using the java.exe present in one of the folders of your Windows PATH variable. Now that Java wouldn't know what /cygdrive/c/nutch is because it is not running as a part of Cygwin and doesn't use the Cygwin emulation layer. It would need the Windows like path C:/nutch since it is running as a full-fledged Windows command. My description might be technically a little inaccurate but I hope I have conveyed the basic idea properly. Regards, Susam Pal http://susam.in/ On 7/27/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote: > I really wonder if this is some kind of nutch + cygwin error. Check this > out. I change the paths to windows-like paths (not the cygwin mounted paths > -- maybe the /cygwin/c mount point is the problem). Note that I use forward > slashes in the windows-like paths: I no longer get the "Input path doesnt > exist" error, though I still get a failure. > > [EMAIL PROTECTED] /cygdrive/c/nutch-2007-07-26_04-01-20/logs > $ nutch crawl C:/nutch-2007-07-26_04-01-20/content/urls.txt -dir > c:/nutch-2007-07-26_04-01-20/content/sf911truth -depth > 3 -topN 200 > crawl started in: c:/nutch-2007-07-26_04-01-20/content/sf911truth > rootUrlDir = C:/nutch-2007-07-26_04-01-20/content/urls.txt > threads = 10 > depth = 3 > topN = 200 > Injector: starting > Injector: crawlDb: c:/nutch-2007-07-26_04-01-20/content/sf911truth/crawldb > Injector: urlDir: C:/nutch-2007-07-26_04-01-20/content/urls.txt > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: > c:/nutch-2007-07-26_04-01-20/content/sf911truth/segments/20070727003008 > Generator: filtering: false > Generator: topN: 200 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls by host, for politeness. > Generator: done. > Fetcher: starting > Fetcher: segment: > c:/nutch-2007-07-26_04-01-20/content/sf911truth/segments/20070727003008 > Fetcher: threads: 10 > fetching http://www.sf911truth.org/ > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:499) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:124) > > [EMAIL PROTECTED] /cygdrive/c/nutch-2007-07-26_04-01-20/logs > $ cat hadoop.log > 2007-07-27 00:30:03,171 INFO crawl.Crawl - crawl started in: > c:/nutch-2007-07-26_04-01-20/content/sf911truth > 2007-07-27 00:30:03,187 INFO crawl.Crawl - rootUrlDir = > C:/nutch-2007-07-26_04-01-20/content/urls.txt > 2007-07-27 00:30:03,187 INFO crawl.Crawl - threads = 10 > 2007-07-27 00:30:03,187 INFO crawl.Crawl - depth = 3 > 2007-07-27 00:30:03,187 INFO crawl.Crawl - topN = 200 > 2007-07-27 00:30:03,281 INFO crawl.Injector - Injector: starting > 2007-07-27 00:30:03,281 INFO crawl.Injector - Injector: crawlDb: > c:/nutch-2007-07-26_04-01-20/content/sf911truth/crawld > b > 2007-07-27 00:30:03,281 INFO crawl.Injector - Injector: urlDir: > C:/nutch-2007-07-26_04-01-20/content/urls.txt > 2007-07-27 00:30:03,296 INFO crawl.Injector - Injector: Converting injected > urls to crawl db entries. > 2007-07-27 00:30:04,031 INFO plugin.PluginRepository - Plugins: looking in: > C:\nutch-2007-07-26_04-01-20\plugins > 2007-07-27 00:30:04,296 INFO plugin.PluginRepository - Plugin > Auto-activation mode: [true] > 2007-07-27 00:30:04,296 INFO plugin.PluginRepository - Registered Plugins: > 2007-07-27 00:30:04,296 INFO plugin.PluginRepository - CyberNeko > HTML Parser (lib-nekohtml) > 2007-07-27 00:30:04,296 INFO plugin.PluginRepository - Site Query > Filter (query-site) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Basic URL > Normalizer (urlnormalizer-basic) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Html Parse > Plug-in (parse-html) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Pass-through > URL Normalizer (urlnormalizer-pass) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Regex URL > Filter Framework (lib-regex-filter) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Feed > Parse/Index/Query Plug-in (feed) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Basic > Indexing Filter (index-basic) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Basic > Summarizer Plug-in (summary-basic) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Text Parse > Plug-in (parse-text) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - JavaScript > Parser (parse-js) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Basic Query > Filter (query-basic) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Regex URL > Filter (urlfilter-regex) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - HTTP > Framework (lib-http) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - XML Libraries > (lib-xml) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - URL Query > Filter (query-url) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Regex URL > Normalizer (urlnormalizer-regex) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Http Protocol > Plug-in (protocol-http) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - the nutch > core extension points (nutch-extensionpoints) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - OPIC Scoring > Plug-in (scoring-opic) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Registered > Extension-Points: > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch > Summarizer (org.apache.nutch.searcher.Summarizer) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch Scoring > (org.apache.nutch.scoring.ScoringFilter) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch > Protocol (org.apache.nutch.protocol.Protocol) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch URL > Normalizer (org.apache.nutch.net.URLNormalizer > ) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch URL > Filter (org.apache.nutch.net.URLFilter) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - HTML Parse > Filter (org.apache.nutch.parse.HtmlParseFilte > r) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch Online > Search Results Clustering Plugin (org.apach > e.nutch.clustering.OnlineClusterer) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch > Indexing Filter (org.apache.nutch.indexer.Indexing > Filter) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch Content > Parser (org.apache.nutch.parse.Parser) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Ontology > Model Loader (org.apache.nutch.ontology.Ontolog > y) > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch > Analysis (org.apache.nutch.analysis.NutchAnalyzer) > > 2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch Query > Filter (org.apache.nutch.searcher.QueryFilte > r) > 2007-07-27 00:30:04,375 WARN regex.RegexURLNormalizer - can't find rules for > scope 'inject', using default > 2007-07-27 00:30:06,046 INFO crawl.Injector - Injector: Merging injected > urls into crawl db. > 2007-07-27 00:30:06,640 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using bu > iltin-java classes where applicable > 2007-07-27 00:30:07,500 INFO crawl.Injector - Injector: done > 2007-07-27 00:30:08,500 INFO crawl.Generator - Generator: Selecting > best-scoring urls due for fetch. > 2007-07-27 00:30:08,500 INFO crawl.Generator - Generator: starting > 2007-07-27 00:30:08,500 INFO crawl.Generator - Generator: segment: > c:/nutch-2007-07-26_04-01-20/content/sf911truth/segm > ents/20070727003008 > 2007-07-27 00:30:08,500 INFO crawl.Generator - Generator: filtering: false > 2007-07-27 00:30:08,500 INFO crawl.Generator - Generator: topN: 200 > 2007-07-27 00:30:08,531 INFO crawl.Generator - Generator: jobtracker is > 'local', generating exactly one partition. > 2007-07-27 00:30:08,984 INFO plugin.PluginRepository - Plugins: looking in: > C:\nutch-2007-07-26_04-01-20\plugins > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Plugin > Auto-activation mode: [true] > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Registered Plugins: > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - CyberNeko > HTML Parser (lib-nekohtml) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Site Query > Filter (query-site) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Basic URL > Normalizer (urlnormalizer-basic) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Html Parse > Plug-in (parse-html) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Pass-through > URL Normalizer (urlnormalizer-pass) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Regex URL > Filter Framework (lib-regex-filter) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Feed > Parse/Index/Query Plug-in (feed) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Basic > Indexing Filter (index-basic) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Basic > Summarizer Plug-in (summary-basic) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Text Parse > Plug-in (parse-text) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - JavaScript > Parser (parse-js) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Basic Query > Filter (query-basic) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Regex URL > Filter (urlfilter-regex) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - HTTP > Framework (lib-http) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - XML Libraries > (lib-xml) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - URL Query > Filter (query-url) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Regex URL > Normalizer (urlnormalizer-regex) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Http Protocol > Plug-in (protocol-http) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - the nutch > core extension points (nutch-extensionpoints) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - OPIC Scoring > Plug-in (scoring-opic) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Registered > Extension-Points: > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch > Summarizer (org.apache.nutch.searcher.Summarizer) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch Scoring > (org.apache.nutch.scoring.ScoringFilter) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch > Protocol (org.apache.nutch.protocol.Protocol) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch URL > Normalizer (org.apache.nutch.net.URLNormalizer > ) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch URL > Filter (org.apache.nutch.net.URLFilter) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - HTML Parse > Filter (org.apache.nutch.parse.HtmlParseFilte > r) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch Online > Search Results Clustering Plugin (org.apach > e.nutch.clustering.OnlineClusterer) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch > Indexing Filter (org.apache.nutch.indexer.Indexing > Filter) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch Content > Parser (org.apache.nutch.parse.Parser) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Ontology > Model Loader (org.apache.nutch.ontology.Ontolog > y) > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch > Analysis (org.apache.nutch.analysis.NutchAnalyzer) > > 2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch Query > Filter (org.apache.nutch.searcher.QueryFilte > r) > 2007-07-27 00:30:09,218 INFO crawl.FetchScheduleFactory - Using > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetch > Schedule > 2007-07-27 00:30:09,218 INFO crawl.AbstractFetchSchedule - > defaultInterval=2592000.0 > 2007-07-27 00:30:09,218 INFO crawl.AbstractFetchSchedule - > maxInterval=7776000.0 > 2007-07-27 00:30:09,234 WARN regex.RegexURLNormalizer - can't find rules for > scope 'partition', using default > 2007-07-27 00:30:09,296 INFO plugin.PluginRepository - Plugins: looking in: > C:\nutch-2007-07-26_04-01-20\plugins > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Plugin > Auto-activation mode: [true] > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Registered Plugins: > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - CyberNeko > HTML Parser (lib-nekohtml) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Site Query > Filter (query-site) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Basic URL > Normalizer (urlnormalizer-basic) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Html Parse > Plug-in (parse-html) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Pass-through > URL Normalizer (urlnormalizer-pass) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Regex URL > Filter Framework (lib-regex-filter) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Feed > Parse/Index/Query Plug-in (feed) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Basic > Indexing Filter (index-basic) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Basic > Summarizer Plug-in (summary-basic) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Text Parse > Plug-in (parse-text) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - JavaScript > Parser (parse-js) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Basic Query > Filter (query-basic) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Regex URL > Filter (urlfilter-regex) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - HTTP > Framework (lib-http) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - XML Libraries > (lib-xml) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - URL Query > Filter (query-url) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Regex URL > Normalizer (urlnormalizer-regex) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Http Protocol > Plug-in (protocol-http) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - the nutch > core extension points (nutch-extensionpoints) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - OPIC Scoring > Plug-in (scoring-opic) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Registered > Extension-Points: > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch > Summarizer (org.apache.nutch.searcher.Summarizer) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch Scoring > (org.apache.nutch.scoring.ScoringFilter) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch > Protocol (org.apache.nutch.protocol.Protocol) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch URL > Normalizer (org.apache.nutch.net.URLNormalizer > ) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch URL > Filter (org.apache.nutch.net.URLFilter) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - HTML Parse > Filter (org.apache.nutch.parse.HtmlParseFilte > r) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch Online > Search Results Clustering Plugin (org.apach > e.nutch.clustering.OnlineClusterer) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch > Indexing Filter (org.apache.nutch.indexer.Indexing > Filter) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch Content > Parser (org.apache.nutch.parse.Parser) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Ontology > Model Loader (org.apache.nutch.ontology.Ontolog > y) > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch > Analysis (org.apache.nutch.analysis.NutchAnalyzer) > > 2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch Query > Filter (org.apache.nutch.searcher.QueryFilte > r) > 2007-07-27 00:30:09,500 INFO crawl.FetchScheduleFactory - Using > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetch > Schedule > 2007-07-27 00:30:09,500 INFO crawl.AbstractFetchSchedule - > defaultInterval=2592000.0 > 2007-07-27 00:30:09,500 INFO crawl.AbstractFetchSchedule - > maxInterval=7776000.0 > 2007-07-27 00:30:10,187 INFO crawl.Generator - Generator: Partitioning > selected urls by host, for politeness. > 2007-07-27 00:30:10,687 INFO plugin.PluginRepository - Plugins: looking in: > C:\nutch-2007-07-26_04-01-20\plugins > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Plugin > Auto-activation mode: [true] > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Registered Plugins: > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - CyberNeko > HTML Parser (lib-nekohtml) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Site Query > Filter (query-site) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Basic URL > Normalizer (urlnormalizer-basic) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Html Parse > Plug-in (parse-html) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Pass-through > URL Normalizer (urlnormalizer-pass) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Regex URL > Filter Framework (lib-regex-filter) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Feed > Parse/Index/Query Plug-in (feed) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Basic > Indexing Filter (index-basic) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Basic > Summarizer Plug-in (summary-basic) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Text Parse > Plug-in (parse-text) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - JavaScript > Parser (parse-js) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Basic Query > Filter (query-basic) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Regex URL > Filter (urlfilter-regex) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - HTTP > Framework (lib-http) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - XML Libraries > (lib-xml) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - URL Query > Filter (query-url) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Regex URL > Normalizer (urlnormalizer-regex) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Http Protocol > Plug-in (protocol-http) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - the nutch > core extension points (nutch-extensionpoints) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - OPIC Scoring > Plug-in (scoring-opic) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Registered > Extension-Points: > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Nutch > Summarizer (org.apache.nutch.searcher.Summarizer) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Nutch Scoring > (org.apache.nutch.scoring.ScoringFilter) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Nutch > Protocol (org.apache.nutch.protocol.Protocol) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Nutch URL > Normalizer (org.apache.nutch.net.URLNormalizer > ) > 2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Nutch URL > Filter (org.apache.nutch.net.URLFilter) > 2007-07-27 00:30:10,875 INFO plugin.PluginRepository - HTML Parse > Filter (org.apache.nutch.parse.HtmlParseFilte > r) > 2007-07-27 00:30:10,875 INFO plugin.PluginRepository - Nutch Online > Search Results Clustering Plugin (org.apach > e.nutch.clustering.OnlineClusterer) > 2007-07-27 00:30:10,875 INFO plugin.PluginRepository - Nutch > Indexing Filter (org.apache.nutch.indexer.Indexing > Filter) > 2007-07-27 00:30:10,875 INFO plugin.PluginRepository - Nutch Content > Parser (org.apache.nutch.parse.Parser) > 2007-07-27 00:30:10,875 INFO plugin.PluginRepository - Ontology > Model Loader (org.apache.nutch.ontology.Ontolog > y) > 2007-07-27 00:30:10,875 INFO plugin.PluginRepository - Nutch > Analysis (org.apache.nutch.analysis.NutchAnalyzer) > > 2007-07-27 00:30:10,875 INFO plugin.PluginRepository - Nutch Query > Filter (org.apache.nutch.searcher.QueryFilte > r) > 2007-07-27 00:30:10,890 WARN regex.RegexURLNormalizer - can't find rules for > scope 'partition', using default > 2007-07-27 00:30:11,625 INFO crawl.Generator - Generator: done. > 2007-07-27 00:30:11,625 INFO fetcher.Fetcher - Fetcher: starting > 2007-07-27 00:30:11,625 INFO fetcher.Fetcher - Fetcher: segment: > c:/nutch-2007-07-26_04-01-20/content/sf911truth/segmen > ts/20070727003008 > 2007-07-27 00:30:12,078 INFO fetcher.Fetcher - Fetcher: threads: 10 > 2007-07-27 00:30:12,093 INFO plugin.PluginRepository - Plugins: looking in: > C:\nutch-2007-07-26_04-01-20\plugins > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Plugin > Auto-activation mode: [true] > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Registered Plugins: > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - CyberNeko > HTML Parser (lib-nekohtml) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Site Query > Filter (query-site) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Basic URL > Normalizer (urlnormalizer-basic) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Html Parse > Plug-in (parse-html) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Pass-through > URL Normalizer (urlnormalizer-pass) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Regex URL > Filter Framework (lib-regex-filter) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Feed > Parse/Index/Query Plug-in (feed) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Basic > Indexing Filter (index-basic) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Basic > Summarizer Plug-in (summary-basic) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Text Parse > Plug-in (parse-text) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - JavaScript > Parser (parse-js) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Basic Query > Filter (query-basic) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Regex URL > Filter (urlfilter-regex) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - HTTP > Framework (lib-http) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - XML Libraries > (lib-xml) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - URL Query > Filter (query-url) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Regex URL > Normalizer (urlnormalizer-regex) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Http Protocol > Plug-in (protocol-http) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - the nutch > core extension points (nutch-extensionpoints) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - OPIC Scoring > Plug-in (scoring-opic) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Registered > Extension-Points: > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch > Summarizer (org.apache.nutch.searcher.Summarizer) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch Scoring > (org.apache.nutch.scoring.ScoringFilter) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch > Protocol (org.apache.nutch.protocol.Protocol) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch URL > Normalizer (org.apache.nutch.net.URLNormalizer > ) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch URL > Filter (org.apache.nutch.net.URLFilter) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - HTML Parse > Filter (org.apache.nutch.parse.HtmlParseFilte > r) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch Online > Search Results Clustering Plugin (org.apach > e.nutch.clustering.OnlineClusterer) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch > Indexing Filter (org.apache.nutch.indexer.Indexing > Filter) > 2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch Content > Parser (org.apache.nutch.parse.Parser) > 2007-07-27 00:30:12,234 INFO plugin.PluginRepository - Ontology > Model Loader (org.apache.nutch.ontology.Ontolog > y) > 2007-07-27 00:30:12,234 INFO plugin.PluginRepository - Nutch > Analysis (org.apache.nutch.analysis.NutchAnalyzer) > > 2007-07-27 00:30:12,234 INFO plugin.PluginRepository - Nutch Query > Filter (org.apache.nutch.searcher.QueryFilte > r) > 2007-07-27 00:30:12,265 INFO fetcher.Fetcher - fetching > http://www.sf911truth.org/ > 2007-07-27 00:30:12,312 FATAL api.RobotRulesParser - Agent we advertise > (microlith-nutch) not listed first in 'http.robo > ts.agents' property! > 2007-07-27 00:30:12,312 INFO http.Http - http.proxy.host = null > 2007-07-27 00:30:12,312 INFO http.Http - http.proxy.port = 8080 > 2007-07-27 00:30:12,312 INFO http.Http - http.timeout = 10000 > 2007-07-27 00:30:12,312 INFO http.Http - http.content.limit = 65536 > 2007-07-27 00:30:12,312 INFO http.Http - http.agent = > microlith-nutch/Nutch-1.0-dev (crawler nutch-2007-07-26_04-01-20; > http://hopoo.dyndns.org; kai(underscore)testing(att)yahoo(dotcom)) > 2007-07-27 00:30:12,312 INFO http.Http - protocol.plugin.check.blocking = > true > 2007-07-27 00:30:12,312 INFO http.Http - protocol.plugin.check.robots = true > 2007-07-27 00:30:12,312 INFO http.Http - fetcher.server.delay = 3000 > 2007-07-27 00:30:12,312 INFO http.Http - http.max.delays = 100 > 2007-07-27 00:30:13,578 WARN regex.RegexURLNormalizer - can't find rules for > scope 'outlink', using default > 2007-07-27 00:30:13,640 INFO crawl.SignatureFactory - Using Signature impl: > org.apache.nutch.crawl.MD5Signature > 2007-07-27 00:30:14,406 INFO plugin.PluginRepository - Plugins: looking in: > C:\nutch-2007-07-26_04-01-20\plugins > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Plugin > Auto-activation mode: [true] > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Registered Plugins: > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - CyberNeko > HTML Parser (lib-nekohtml) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Site Query > Filter (query-site) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Basic URL > Normalizer (urlnormalizer-basic) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Html Parse > Plug-in (parse-html) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Pass-through > URL Normalizer (urlnormalizer-pass) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Regex URL > Filter Framework (lib-regex-filter) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Feed > Parse/Index/Query Plug-in (feed) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Basic > Indexing Filter (index-basic) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Basic > Summarizer Plug-in (summary-basic) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Text Parse > Plug-in (parse-text) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - JavaScript > Parser (parse-js) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Basic Query > Filter (query-basic) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Regex URL > Filter (urlfilter-regex) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - HTTP > Framework (lib-http) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - XML Libraries > (lib-xml) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - URL Query > Filter (query-url) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Regex URL > Normalizer (urlnormalizer-regex) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Http Protocol > Plug-in (protocol-http) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - the nutch > core extension points (nutch-extensionpoints) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - OPIC Scoring > Plug-in (scoring-opic) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Registered > Extension-Points: > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch > Summarizer (org.apache.nutch.searcher.Summarizer) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch Scoring > (org.apache.nutch.scoring.ScoringFilter) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch > Protocol (org.apache.nutch.protocol.Protocol) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch URL > Normalizer (org.apache.nutch.net.URLNormalizer > ) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch URL > Filter (org.apache.nutch.net.URLFilter) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - HTML Parse > Filter (org.apache.nutch.parse.HtmlParseFilte > r) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch Online > Search Results Clustering Plugin (org.apach > e.nutch.clustering.OnlineClusterer) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch > Indexing Filter (org.apache.nutch.indexer.Indexing > Filter) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch Content > Parser (org.apache.nutch.parse.Parser) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Ontology > Model Loader (org.apache.nutch.ontology.Ontolog > y) > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch > Analysis (org.apache.nutch.analysis.NutchAnalyzer) > > 2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch Query > Filter (org.apache.nutch.searcher.QueryFilte > r) > 2007-07-27 00:30:14,718 WARN mapred.LocalJobRunner - job_8r2j8 > java.lang.IllegalArgumentException: Illegal Capacity: -1 > at java.util.ArrayList.<init>(ArrayList.java:111) > at > org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:149) > at > org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:94) > at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:311) > at > org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:41) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:326) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155) > > > > > > ____________________________________________________________________________________Ready > for the edge of your seat? > Check out tonight's top picks on Yahoo! TV. > http://tv.yahoo.com/ ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
