Hi,
This is most likely an URL filter issue. Check all URL filters. There's also a
test program for URL filtering. Try it out.
This is indexchecker output for one URL. Is this URL filtered or not? I don't
know how to interpret output
ponto:(crawler)runtime/deploy>bin/nutch indexchecker http://www.root.cz
11/10/14 06:01:00 INFO indexer.IndexingFiltersChecker: fetching:
http://www.root.cz
11/10/14 06:01:00 INFO plugin.PluginRepository: Plugins: looking in:
/tmp/hadoop-crawler/hadoop-unjar3406850446948112163/plugins
11/10/14 06:01:00 INFO plugin.PluginRepository: Plugin Auto-activation
mode: [true]
11/10/14 06:01:00 INFO plugin.PluginRepository: Registered Plugins:
11/10/14 06:01:00 INFO plugin.PluginRepository: the nutch core
extension points (nutch-extensionpoints)
11/10/14 06:01:00 INFO plugin.PluginRepository: Regex URL
Normalizer (urlnormalizer-regex)
11/10/14 06:01:00 INFO plugin.PluginRepository: Basic URL
Normalizer (urlnormalizer-basic)
11/10/14 06:01:00 INFO plugin.PluginRepository: Tika Parser
Plug-in (parse-tika)
11/10/14 06:01:00 INFO plugin.PluginRepository: Domain URL
Filter (urlfilter-domain)
11/10/14 06:01:00 INFO plugin.PluginRepository: HTTP Framework
(lib-http)
11/10/14 06:01:00 INFO plugin.PluginRepository: Regex URL Filter
(urlfilter-regex)
11/10/14 06:01:00 INFO plugin.PluginRepository: Regex URL Filter
Framework (lib-regex-filter)
11/10/14 06:01:00 INFO plugin.PluginRepository: Http Protocol
Plug-in (protocol-http)
11/10/14 06:01:00 INFO plugin.PluginRepository: Registered Extension-Points:
11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Protocol
(org.apache.nutch.protocol.Protocol)
11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Segment
Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch URL Filter
(org.apache.nutch.net.URLFilter)
11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
11/10/14 06:01:00 INFO plugin.PluginRepository: HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Content
Parser (org.apache.nutch.parse.Parser)
11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
11/10/14 06:01:00 INFO http.Http: http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
11/10/14 06:01:02 INFO indexer.IndexingFiltersChecker: parsing:
http://www.root.cz
11/10/14 06:01:02 INFO indexer.IndexingFiltersChecker: contentType:
application/xhtml+xml
11/10/14 06:01:02 INFO conf.Configuration: found resource
parse-plugins.xml at
file:/tmp/hadoop-crawler/hadoop-unjar3406850446948112163/parse-plugins.xml
11/10/14 06:01:02 WARN parse.ParserFactory: ParserFactory: Plugin:
org.apache.nutch.parse.html.HtmlParser mapped to contentType
application/xhtml+xml via parse-plugins.xml, but not enabled via
plugin.includes in nutch-default.xml