I really wonder if this is some kind of nutch + cygwin error. Check this out.
I change the paths to windows-like paths (not the cygwin mounted paths -- maybe
the /cygwin/c mount point is the problem). Note that I use forward slashes in
the windows-like paths: I no longer get the "Input path doesnt exist" error,
though I still get a failure.
[EMAIL PROTECTED] /cygdrive/c/nutch-2007-07-26_04-01-20/logs
$ nutch crawl C:/nutch-2007-07-26_04-01-20/content/urls.txt -dir
c:/nutch-2007-07-26_04-01-20/content/sf911truth -depth
3 -topN 200
crawl started in: c:/nutch-2007-07-26_04-01-20/content/sf911truth
rootUrlDir = C:/nutch-2007-07-26_04-01-20/content/urls.txt
threads = 10
depth = 3
topN = 200
Injector: starting
Injector: crawlDb: c:/nutch-2007-07-26_04-01-20/content/sf911truth/crawldb
Injector: urlDir: C:/nutch-2007-07-26_04-01-20/content/urls.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment:
c:/nutch-2007-07-26_04-01-20/content/sf911truth/segments/20070727003008
Generator: filtering: false
Generator: topN: 200
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment:
c:/nutch-2007-07-26_04-01-20/content/sf911truth/segments/20070727003008
Fetcher: threads: 10
fetching http://www.sf911truth.org/
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:499)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
[EMAIL PROTECTED] /cygdrive/c/nutch-2007-07-26_04-01-20/logs
$ cat hadoop.log
2007-07-27 00:30:03,171 INFO crawl.Crawl - crawl started in:
c:/nutch-2007-07-26_04-01-20/content/sf911truth
2007-07-27 00:30:03,187 INFO crawl.Crawl - rootUrlDir =
C:/nutch-2007-07-26_04-01-20/content/urls.txt
2007-07-27 00:30:03,187 INFO crawl.Crawl - threads = 10
2007-07-27 00:30:03,187 INFO crawl.Crawl - depth = 3
2007-07-27 00:30:03,187 INFO crawl.Crawl - topN = 200
2007-07-27 00:30:03,281 INFO crawl.Injector - Injector: starting
2007-07-27 00:30:03,281 INFO crawl.Injector - Injector: crawlDb:
c:/nutch-2007-07-26_04-01-20/content/sf911truth/crawld
b
2007-07-27 00:30:03,281 INFO crawl.Injector - Injector: urlDir:
C:/nutch-2007-07-26_04-01-20/content/urls.txt
2007-07-27 00:30:03,296 INFO crawl.Injector - Injector: Converting injected
urls to crawl db entries.
2007-07-27 00:30:04,031 INFO plugin.PluginRepository - Plugins: looking in:
C:\nutch-2007-07-26_04-01-20\plugins
2007-07-27 00:30:04,296 INFO plugin.PluginRepository - Plugin Auto-activation
mode: [true]
2007-07-27 00:30:04,296 INFO plugin.PluginRepository - Registered Plugins:
2007-07-27 00:30:04,296 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-07-27 00:30:04,296 INFO plugin.PluginRepository - Site Query
Filter (query-site)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Html Parse
Plug-in (parse-html)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Pass-through
URL Normalizer (urlnormalizer-pass)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Feed
Parse/Index/Query Plug-in (feed)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Text Parse
Plug-in (parse-text)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - JavaScript
Parser (parse-js)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Basic Query
Filter (query-basic)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - XML Libraries
(lib-xml)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - URL Query
Filter (query-url)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Registered
Extension-Points:
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer
)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilte
r)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch Online
Search Results Clustering Plugin (org.apach
e.nutch.clustering.OnlineClusterer)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.Indexing
Filter)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontolog
y)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-07-27 00:30:04,312 INFO plugin.PluginRepository - Nutch Query
Filter (org.apache.nutch.searcher.QueryFilte
r)
2007-07-27 00:30:04,375 WARN regex.RegexURLNormalizer - can't find rules for
scope 'inject', using default
2007-07-27 00:30:06,046 INFO crawl.Injector - Injector: Merging injected urls
into crawl db.
2007-07-27 00:30:06,640 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using bu
iltin-java classes where applicable
2007-07-27 00:30:07,500 INFO crawl.Injector - Injector: done
2007-07-27 00:30:08,500 INFO crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2007-07-27 00:30:08,500 INFO crawl.Generator - Generator: starting
2007-07-27 00:30:08,500 INFO crawl.Generator - Generator: segment:
c:/nutch-2007-07-26_04-01-20/content/sf911truth/segm
ents/20070727003008
2007-07-27 00:30:08,500 INFO crawl.Generator - Generator: filtering: false
2007-07-27 00:30:08,500 INFO crawl.Generator - Generator: topN: 200
2007-07-27 00:30:08,531 INFO crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2007-07-27 00:30:08,984 INFO plugin.PluginRepository - Plugins: looking in:
C:\nutch-2007-07-26_04-01-20\plugins
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Plugin Auto-activation
mode: [true]
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Registered Plugins:
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Site Query
Filter (query-site)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Html Parse
Plug-in (parse-html)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Pass-through
URL Normalizer (urlnormalizer-pass)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Feed
Parse/Index/Query Plug-in (feed)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Text Parse
Plug-in (parse-text)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - JavaScript
Parser (parse-js)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Basic Query
Filter (query-basic)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - XML Libraries
(lib-xml)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - URL Query
Filter (query-url)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Registered
Extension-Points:
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer
)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilte
r)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch Online
Search Results Clustering Plugin (org.apach
e.nutch.clustering.OnlineClusterer)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.Indexing
Filter)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontolog
y)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-07-27 00:30:09,187 INFO plugin.PluginRepository - Nutch Query
Filter (org.apache.nutch.searcher.QueryFilte
r)
2007-07-27 00:30:09,218 INFO crawl.FetchScheduleFactory - Using FetchSchedule
impl: org.apache.nutch.crawl.DefaultFetch
Schedule
2007-07-27 00:30:09,218 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000.0
2007-07-27 00:30:09,218 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000.0
2007-07-27 00:30:09,234 WARN regex.RegexURLNormalizer - can't find rules for
scope 'partition', using default
2007-07-27 00:30:09,296 INFO plugin.PluginRepository - Plugins: looking in:
C:\nutch-2007-07-26_04-01-20\plugins
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Plugin Auto-activation
mode: [true]
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Registered Plugins:
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Site Query
Filter (query-site)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Html Parse
Plug-in (parse-html)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Pass-through
URL Normalizer (urlnormalizer-pass)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Feed
Parse/Index/Query Plug-in (feed)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Text Parse
Plug-in (parse-text)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - JavaScript
Parser (parse-js)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Basic Query
Filter (query-basic)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - XML Libraries
(lib-xml)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - URL Query
Filter (query-url)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Registered
Extension-Points:
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer
)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilte
r)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch Online
Search Results Clustering Plugin (org.apach
e.nutch.clustering.OnlineClusterer)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.Indexing
Filter)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontolog
y)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-07-27 00:30:09,468 INFO plugin.PluginRepository - Nutch Query
Filter (org.apache.nutch.searcher.QueryFilte
r)
2007-07-27 00:30:09,500 INFO crawl.FetchScheduleFactory - Using FetchSchedule
impl: org.apache.nutch.crawl.DefaultFetch
Schedule
2007-07-27 00:30:09,500 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000.0
2007-07-27 00:30:09,500 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000.0
2007-07-27 00:30:10,187 INFO crawl.Generator - Generator: Partitioning
selected urls by host, for politeness.
2007-07-27 00:30:10,687 INFO plugin.PluginRepository - Plugins: looking in:
C:\nutch-2007-07-26_04-01-20\plugins
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Plugin Auto-activation
mode: [true]
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Registered Plugins:
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Site Query
Filter (query-site)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Html Parse
Plug-in (parse-html)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Pass-through
URL Normalizer (urlnormalizer-pass)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Feed
Parse/Index/Query Plug-in (feed)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Text Parse
Plug-in (parse-text)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - JavaScript
Parser (parse-js)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Basic Query
Filter (query-basic)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - XML Libraries
(lib-xml)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - URL Query
Filter (query-url)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Registered
Extension-Points:
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer
)
2007-07-27 00:30:10,859 INFO plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2007-07-27 00:30:10,875 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilte
r)
2007-07-27 00:30:10,875 INFO plugin.PluginRepository - Nutch Online
Search Results Clustering Plugin (org.apach
e.nutch.clustering.OnlineClusterer)
2007-07-27 00:30:10,875 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.Indexing
Filter)
2007-07-27 00:30:10,875 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-07-27 00:30:10,875 INFO plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontolog
y)
2007-07-27 00:30:10,875 INFO plugin.PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-07-27 00:30:10,875 INFO plugin.PluginRepository - Nutch Query
Filter (org.apache.nutch.searcher.QueryFilte
r)
2007-07-27 00:30:10,890 WARN regex.RegexURLNormalizer - can't find rules for
scope 'partition', using default
2007-07-27 00:30:11,625 INFO crawl.Generator - Generator: done.
2007-07-27 00:30:11,625 INFO fetcher.Fetcher - Fetcher: starting
2007-07-27 00:30:11,625 INFO fetcher.Fetcher - Fetcher: segment:
c:/nutch-2007-07-26_04-01-20/content/sf911truth/segmen
ts/20070727003008
2007-07-27 00:30:12,078 INFO fetcher.Fetcher - Fetcher: threads: 10
2007-07-27 00:30:12,093 INFO plugin.PluginRepository - Plugins: looking in:
C:\nutch-2007-07-26_04-01-20\plugins
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Plugin Auto-activation
mode: [true]
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Registered Plugins:
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Site Query
Filter (query-site)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Html Parse
Plug-in (parse-html)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Pass-through
URL Normalizer (urlnormalizer-pass)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Feed
Parse/Index/Query Plug-in (feed)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Text Parse
Plug-in (parse-text)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - JavaScript
Parser (parse-js)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Basic Query
Filter (query-basic)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - XML Libraries
(lib-xml)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - URL Query
Filter (query-url)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Registered
Extension-Points:
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer
)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilte
r)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch Online
Search Results Clustering Plugin (org.apach
e.nutch.clustering.OnlineClusterer)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.Indexing
Filter)
2007-07-27 00:30:12,218 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-07-27 00:30:12,234 INFO plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontolog
y)
2007-07-27 00:30:12,234 INFO plugin.PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-07-27 00:30:12,234 INFO plugin.PluginRepository - Nutch Query
Filter (org.apache.nutch.searcher.QueryFilte
r)
2007-07-27 00:30:12,265 INFO fetcher.Fetcher - fetching
http://www.sf911truth.org/
2007-07-27 00:30:12,312 FATAL api.RobotRulesParser - Agent we advertise
(microlith-nutch) not listed first in 'http.robo
ts.agents' property!
2007-07-27 00:30:12,312 INFO http.Http - http.proxy.host = null
2007-07-27 00:30:12,312 INFO http.Http - http.proxy.port = 8080
2007-07-27 00:30:12,312 INFO http.Http - http.timeout = 10000
2007-07-27 00:30:12,312 INFO http.Http - http.content.limit = 65536
2007-07-27 00:30:12,312 INFO http.Http - http.agent =
microlith-nutch/Nutch-1.0-dev (crawler nutch-2007-07-26_04-01-20;
http://hopoo.dyndns.org; kai(underscore)testing(att)yahoo(dotcom))
2007-07-27 00:30:12,312 INFO http.Http - protocol.plugin.check.blocking = true
2007-07-27 00:30:12,312 INFO http.Http - protocol.plugin.check.robots = true
2007-07-27 00:30:12,312 INFO http.Http - fetcher.server.delay = 3000
2007-07-27 00:30:12,312 INFO http.Http - http.max.delays = 100
2007-07-27 00:30:13,578 WARN regex.RegexURLNormalizer - can't find rules for
scope 'outlink', using default
2007-07-27 00:30:13,640 INFO crawl.SignatureFactory - Using Signature impl:
org.apache.nutch.crawl.MD5Signature
2007-07-27 00:30:14,406 INFO plugin.PluginRepository - Plugins: looking in:
C:\nutch-2007-07-26_04-01-20\plugins
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Plugin Auto-activation
mode: [true]
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Registered Plugins:
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Site Query
Filter (query-site)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Html Parse
Plug-in (parse-html)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Pass-through
URL Normalizer (urlnormalizer-pass)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Feed
Parse/Index/Query Plug-in (feed)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Text Parse
Plug-in (parse-text)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - JavaScript
Parser (parse-js)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Basic Query
Filter (query-basic)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - XML Libraries
(lib-xml)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - URL Query
Filter (query-url)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Registered
Extension-Points:
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer
)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilte
r)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch Online
Search Results Clustering Plugin (org.apach
e.nutch.clustering.OnlineClusterer)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.Indexing
Filter)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontolog
y)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-07-27 00:30:14,609 INFO plugin.PluginRepository - Nutch Query
Filter (org.apache.nutch.searcher.QueryFilte
r)
2007-07-27 00:30:14,718 WARN mapred.LocalJobRunner - job_8r2j8
java.lang.IllegalArgumentException: Illegal Capacity: -1
at java.util.ArrayList.<init>(ArrayList.java:111)
at
org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:149)
at
org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:94)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:311)
at
org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:41)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:326)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155)
____________________________________________________________________________________Ready
for the edge of your seat?
Check out tonight's top picks on Yahoo! TV.
http://tv.yahoo.com/-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general