Hi All, I hope you can help as I am becomming rather depressed with Nutch on
Windows.

Using: Windows XP Pro SP2 - Nutch-0.9 - Cygwin [current version from cygwin
site] - Java JDK 1.6.0 - Java Platform Standard Edition 1.6.0

I cannot stop Nutch from crawling parent directories, I have looked at other
threads and none seem to work.

I have tried to include protocol-smb [jcifs] but Cygwin keeps prompting for
Java syntax corrections.

Below I have listed my configurations along with the command I type in
cygwin for jcifs:

CRAWL-URLFILTER
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(http|ftp|mailto):
+^(file|smb):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME
# Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^file:///C:/Policies/

# skip everything else
-.

NUTCH-SITE

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
<!-- Put site-specific property overrides in this file. -->

<nutch-conf>

<property>
 <name>http.agent.name</name>
 <value>pascall</value>
 <description></description>
</property>

<property>
  <name>file.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be
truncated;
  otherwise, no truncation at all.
  </description>
</property>

<property>
  <name>file.crawl.parent</name>
  <value>false</value>
  <description>The crawler is not restricted to the directories that you
specified in the
    Urls file but it is jumping into the parent directories as well. For
your own crawlings you can
    change this bahavior (set to false) the way that only directories
beneath the directories that you specify get
    crawled.</description>
</property>

<property>
<name>plugin.includes</name> 
<value>protocol-file|protocol-smb|scoring-opic|parse-(msexcel|mspowerpoint|msword|xml|text|html|pdf)|index-basic|query-(basic|site|url)</value>
</property> 

</nutch-conf>

CYGWIN

Using cygwin I enter the command from C:\Sun\Java\jdk160\bin\

java -Djava.protocol.handler.pkgs=jcifs

When I press return the cygwin shell displays a list of java commands as
though I am using incorrect syntax.

Dump of Crawl from Cygwin:

2007-05-24 14:04:16,140 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:16,171 INFO  crawl.Crawl - crawl started in: crawl
2007-05-24 14:04:16,171 INFO  crawl.Crawl - rootUrlDir = urls.txt
2007-05-24 14:04:16,171 INFO  crawl.Crawl - threads = 10
2007-05-24 14:04:16,171 INFO  crawl.Crawl - depth = 5
2007-05-24 14:04:16,281 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: starting
2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: crawlDb:
crawl/crawldb
2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: urlDir: urls.txt
2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: Converting injected
urls to crawl db entries.
2007-05-24 14:04:16,328 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:16,843 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:16,953 INFO  plugin.PluginRepository - Plugins: looking in:
C:\nutch-0.9\plugins
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered Plugins:
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         the nutch core
extension points (nutch-extensionpoints)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         MSPowerPoint 
Parse
Plug-in (parse-mspowerpoint)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         Basic Query 
Filter
(query-basic)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         Basic Indexing
Filter (index-basic)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         Html Parse 
Plug-in
(parse-html)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         Pdf Parse 
Plug-in
(parse-pdf)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         Site Query 
Filter
(query-site)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         Jakarta POI - 
Java
API To Access Microsoft Format Files (lib-jakarta-poi)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         Text Parse 
Plug-in
(parse-text)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         MSWord Parse
Plug-in (parse-msword)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         SMB Protocol
Plug-in (protocol-smb)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         MSExcel Parse
Plug-in (parse-msexcel)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         OPIC Scoring
Plug-in (scoring-opic)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         CyberNeko HTML
Parser (lib-nekohtml)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         Log4j 
(lib-log4j)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         File Protocol
Plug-in (protocol-file)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         URL Query Filter
(query-url)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         Parse MS 
Documents
Framework (lib-parsems)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
Extension-Points:
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         Nutch Online 
Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         HTML Parse 
Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         Nutch Query 
Filter
(org.apache.nutch.searcher.QueryFilter)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -         Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-05-24 14:04:17,875 INFO  crawl.Injector - Injector: Merging injected
urls into crawl db.
2007-05-24 14:04:17,906 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:18,156 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:18,375 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2007-05-24 14:04:19,281 INFO  crawl.Injector - Injector: done
2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: starting
2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: segment:
crawl/segments/20070524140420
2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: filtering: false
2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: topN: 2147483647
2007-05-24 14:04:20,312 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:20,312 INFO  crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2007-05-24 14:04:20,562 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:20,609 INFO  plugin.PluginRepository - Plugins: looking in:
C:\nutch-0.9\plugins
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered Plugins:
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         the nutch core
extension points (nutch-extensionpoints)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         MSPowerPoint 
Parse
Plug-in (parse-mspowerpoint)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         Basic Query 
Filter
(query-basic)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         Basic Indexing
Filter (index-basic)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         Html Parse 
Plug-in
(parse-html)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         Pdf Parse 
Plug-in
(parse-pdf)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         Site Query 
Filter
(query-site)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         Jakarta POI - 
Java
API To Access Microsoft Format Files (lib-jakarta-poi)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         Text Parse 
Plug-in
(parse-text)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         MSWord Parse
Plug-in (parse-msword)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         SMB Protocol
Plug-in (protocol-smb)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         MSExcel Parse
Plug-in (parse-msexcel)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         OPIC Scoring
Plug-in (scoring-opic)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         CyberNeko HTML
Parser (lib-nekohtml)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         Log4j 
(lib-log4j)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         File Protocol
Plug-in (protocol-file)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         URL Query Filter
(query-url)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         Parse MS 
Documents
Framework (lib-parsems)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
Extension-Points:
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         Nutch Online 
Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         HTML Parse 
Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         Nutch Query 
Filter
(org.apache.nutch.searcher.QueryFilter)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -         Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-05-24 14:04:20,796 WARN  crawl.PartitionUrlByHost - Malformed URL:
'smb://sql1/Sales/DATA/'
2007-05-24 14:04:20,843 INFO  plugin.PluginRepository - Plugins: looking in:
C:\nutch-0.9\plugins
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered Plugins:
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         the nutch core
extension points (nutch-extensionpoints)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         MSPowerPoint 
Parse
Plug-in (parse-mspowerpoint)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         Basic Query 
Filter
(query-basic)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         Basic Indexing
Filter (index-basic)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         Html Parse 
Plug-in
(parse-html)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         Pdf Parse 
Plug-in
(parse-pdf)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         Site Query 
Filter
(query-site)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         Jakarta POI - 
Java
API To Access Microsoft Format Files (lib-jakarta-poi)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         Text Parse 
Plug-in
(parse-text)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         MSWord Parse
Plug-in (parse-msword)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         SMB Protocol
Plug-in (protocol-smb)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         MSExcel Parse
Plug-in (parse-msexcel)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         OPIC Scoring
Plug-in (scoring-opic)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         CyberNeko HTML
Parser (lib-nekohtml)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         Log4j 
(lib-log4j)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         File Protocol
Plug-in (protocol-file)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         URL Query Filter
(query-url)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         Parse MS 
Documents
Framework (lib-parsems)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
Extension-Points:
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         Nutch Online 
Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         HTML Parse 
Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         Nutch Query 
Filter
(org.apache.nutch.searcher.QueryFilter)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -         Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-05-24 14:04:21,578 INFO  crawl.Generator - Generator: Partitioning
selected urls by host, for politeness.
2007-05-24 14:04:21,593 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:21,828 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:21,859 INFO  plugin.PluginRepository - Plugins: looking in:
C:\nutch-0.9\plugins
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered Plugins:
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         the nutch core
extension points (nutch-extensionpoints)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         MSPowerPoint 
Parse
Plug-in (parse-mspowerpoint)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         Basic Query 
Filter
(query-basic)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         Basic Indexing
Filter (index-basic)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         Html Parse 
Plug-in
(parse-html)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         Pdf Parse 
Plug-in
(parse-pdf)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         Site Query 
Filter
(query-site)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         Jakarta POI - 
Java
API To Access Microsoft Format Files (lib-jakarta-poi)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         Text Parse 
Plug-in
(parse-text)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         MSWord Parse
Plug-in (parse-msword)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         SMB Protocol
Plug-in (protocol-smb)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         MSExcel Parse
Plug-in (parse-msexcel)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         OPIC Scoring
Plug-in (scoring-opic)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         CyberNeko HTML
Parser (lib-nekohtml)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         Log4j 
(lib-log4j)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         File Protocol
Plug-in (protocol-file)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         URL Query Filter
(query-url)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         Parse MS 
Documents
Framework (lib-parsems)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
Extension-Points:
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         Nutch Online 
Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         HTML Parse 
Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         Nutch Query 
Filter
(org.apache.nutch.searcher.QueryFilter)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -         Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
'smb://sql1/Sales/DATA/'
2007-05-24 14:04:22,843 INFO  crawl.Generator - Generator: done.
2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: starting
2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: segment:
crawl/segments/20070524140420
2007-05-24 14:04:22,859 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:23,156 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:23,187 INFO  fetcher.Fetcher - Fetcher: threads: 10
2007-05-24 14:04:23,203 INFO  plugin.PluginRepository - Plugins: looking in:
C:\nutch-0.9\plugins
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered Plugins:
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         the nutch core
extension points (nutch-extensionpoints)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         MSPowerPoint 
Parse
Plug-in (parse-mspowerpoint)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         Basic Query 
Filter
(query-basic)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         Basic Indexing
Filter (index-basic)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         Html Parse 
Plug-in
(parse-html)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         Pdf Parse 
Plug-in
(parse-pdf)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         Site Query 
Filter
(query-site)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         Jakarta POI - 
Java
API To Access Microsoft Format Files (lib-jakarta-poi)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         Text Parse 
Plug-in
(parse-text)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         MSWord Parse
Plug-in (parse-msword)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         SMB Protocol
Plug-in (protocol-smb)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         MSExcel Parse
Plug-in (parse-msexcel)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         OPIC Scoring
Plug-in (scoring-opic)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         CyberNeko HTML
Parser (lib-nekohtml)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         Log4j 
(lib-log4j)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         File Protocol
Plug-in (protocol-file)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         URL Query Filter
(query-url)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         Parse MS 
Documents
Framework (lib-parsems)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
Extension-Points:
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         Nutch Online 
Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         HTML Parse 
Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         Nutch Query 
Filter
(org.apache.nutch.searcher.QueryFilter)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -         Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetching
smb://sql1/Sales/DATA/
2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetch of
smb://sql1/Sales/DATA/ failed with:
org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException:
unknown protocol: smb
2007-05-24 14:04:23,500 INFO  fetcher.Fetcher - fetching
file:///C:/Policies/
2007-05-24 14:04:23,718 INFO  crawl.SignatureFactory - Using Signature impl:
org.apache.nutch.crawl.MD5Signature
2007-05-24 14:04:24,671 INFO  plugin.PluginRepository - Plugins: looking in:
C:\nutch-0.9\plugins
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered Plugins:
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         the nutch core
extension points (nutch-extensionpoints)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         MSPowerPoint 
Parse
Plug-in (parse-mspowerpoint)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         Basic Query 
Filter
(query-basic)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         Basic Indexing
Filter (index-basic)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         Html Parse 
Plug-in
(parse-html)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         Pdf Parse 
Plug-in
(parse-pdf)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         Site Query 
Filter
(query-site)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         Jakarta POI - 
Java
API To Access Microsoft Format Files (lib-jakarta-poi)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         Text Parse 
Plug-in
(parse-text)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         MSWord Parse
Plug-in (parse-msword)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         SMB Protocol
Plug-in (protocol-smb)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         MSExcel Parse
Plug-in (parse-msexcel)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         OPIC Scoring
Plug-in (scoring-opic)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         CyberNeko HTML
Parser (lib-nekohtml)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         Log4j 
(lib-log4j)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         File Protocol
Plug-in (protocol-file)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         URL Query Filter
(query-url)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         Parse MS 
Documents
Framework (lib-parsems)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
Extension-Points:
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         Nutch Online 
Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         HTML Parse 
Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         Nutch Query 
Filter
(org.apache.nutch.searcher.QueryFilter)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -         Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-05-24 14:04:25,171 INFO  fetcher.Fetcher - Fetcher: done
2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: starting
2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: db:
crawl/crawldb
2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: segments:
[crawl/segments/20070524140420]
2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: additions
allowed: true
2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
normalizing: true
2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL filtering:
true
2007-05-24 14:04:25,203 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:25,203 INFO  crawl.CrawlDb - CrawlDb update: Merging
segment data into db.
2007-05-24 14:04:25,421 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:25,468 INFO  plugin.PluginRepository - Plugins: looking in:
C:\nutch-0.9\plugins
2007-05-24 14:04:25,593 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]


Thank you for reading my post, hope you can help.

Regards,

Oli
-- 
View this message in context: 
http://www.nabble.com/WIN-XP-PRO--Djava.protocol*-file%3A---c%3A-folder--Crawling-Parents-tf3809966.html#a10783382
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to