Hi,

I am working on the same issue as you, So far I could crawl file:///C:/* but
i am stucked on the smb part. It looks to me that this plugin isn't working
properly so it needs to be fixed for the newer version of nutch.

The error I get differs a bit from yours it is:

2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetching
smb://mobidick/test/
2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetch of
smb://mobidick/test/ failed with:
org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=smb

I will dive into the plugin-smb and try out to narrow the problem Maybe we
can work together to get a quick solution.



---SNIP---

# accept hosts in MY.DOMAIN.NAME
# Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^file:///C:/Policies/ <<-- why you put it here it doesn't make sense
because the +^(file|smb) line above is already fitting so this will be
skipped 
---SNIP ---

---SNIP ---
2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
'smb://sql1/Sales/DATA/' 
//did you cuoted the url or is it displayed in the logs like this? I dont
get this error 
---SNIP ---

try this  in package org.apache.nutch.crawl.Crawl

  public static void main(String args[]) throws Exception {
          System.setProperty("java.protocol.handler.pkgs", "jcifs"); // new 
          LOG.info("SMB Info: " +
System.getProperty("java.protocol.handler.pkgs")); //new 
          LOG.info("SMB Info: " +  new
java.util.PropertyPermission("java.protocol.handler.pkgs","read,
write").toString());//new 
          if (args.length < 1) {
      System.out.println
        ("Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN
N]");
      return;
    }
---SNIP---

check out this:
http://java.sun.com/developer/onlineTraining/protocolhandlers/





opoole wrote:
> 
> Hi All, I hope you can help as I am becomming rather depressed with Nutch
> on Windows.
> 
> Using: Windows XP Pro SP2 - Nutch-0.9 - Cygwin [current version from
> cygwin site] - Java JDK 1.6.0 - Java Platform Standard Edition 1.6.0
> 
> I cannot stop Nutch from crawling parent directories, I have looked at
> other threads and none seem to work.
> 
> I have tried to include protocol-smb [jcifs] but Cygwin keeps prompting
> for Java syntax corrections.
> 
> Below I have listed my configurations along with the command I type in
> cygwin for jcifs:
> 
> CRAWL-URLFILTER
> # The url filter file used by the crawl command.
> 
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
> 
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
> 
> # skip file:, ftp:, & mailto: urls
> -^(http|ftp|mailto):
> +^(file|smb):
> 
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> 
> # skip URLs containing certain characters as probable queries, etc.
> 
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> 
> # accept hosts in MY.DOMAIN.NAME
> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> +^file:///C:/Policies/ <<-- why you put it here it doesnt make sese 
> because the +^(file|smb) is already fitting !
> 
> # skip everything else
> -.
> 
> NUTCH-SITE
> 
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> 
> <nutch-conf>
> 
> <property>
>  <name>http.agent.name</name>
>  <value>pascall</value>
>  <description></description>
> </property>
> 
> <property>
>   <name>file.content.limit</name>
>   <value>-1</value>
>   <description>The length limit for downloaded content, in bytes.
>   If this value is nonnegative (>=0), content longer than it will be
> truncated;
>   otherwise, no truncation at all.
>   </description>
> </property>
> 
> <property>
>   <name>file.crawl.parent</name>
>   <value>false</value>
>   <description>The crawler is not restricted to the directories that you
> specified in the
>     Urls file but it is jumping into the parent directories as well. For
> your own crawlings you can
>     change this bahavior (set to false) the way that only directories
> beneath the directories that you specify get
>     crawled.</description>
> </property>
> 
> <property>
> <name>plugin.includes</name> 
> <value>protocol-file|protocol-smb|scoring-opic|parse-(msexcel|mspowerpoint|msword|xml|text|html|pdf)|index-basic|query-(basic|site|url)</value>
> </property> 
> 
> </nutch-conf>
> 
> CYGWIN
> 
> Using cygwin I enter the command from C:\Sun\Java\jdk160\bin\
> 
> java -Djava.protocol.handler.pkgs=jcifs
> 
> When I press return the cygwin shell displays a list of java commands as
> though I am using incorrect syntax.
> 
> Dump of Crawl from Cygwin:
> 
> 2007-05-24 14:04:16,140 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - crawl started in: crawl
> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - rootUrlDir = urls.txt
> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - threads = 10
> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - depth = 5
> 2007-05-24 14:04:16,281 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: starting
> 2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: crawlDb:
> crawl/crawldb
> 2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: urlDir: urls.txt
> 2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: Converting
> injected urls to crawl db entries.
> 2007-05-24 14:04:16,328 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:16,843 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:16,953 INFO  plugin.PluginRepository - Plugins: looking
> in: C:\nutch-0.9\plugins
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
> Plugins:
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       the nutch core
> extension points (nutch-extensionpoints)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       MSPowerPoint
> Parse Plug-in (parse-mspowerpoint)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       Basic Query
> Filter (query-basic)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       Basic Indexing
> Filter (index-basic)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       Html Parse
> Plug-in (parse-html)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       Pdf Parse 
> Plug-in
> (parse-pdf)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       Site Query 
> Filter
> (query-site)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       Jakarta POI -
> Java API To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       Text Parse
> Plug-in (parse-text)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       MSWord Parse
> Plug-in (parse-msword)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       SMB Protocol
> Plug-in (protocol-smb)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       MSExcel Parse
> Plug-in (parse-msexcel)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       OPIC Scoring
> Plug-in (scoring-opic)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       Log4j 
> (lib-log4j)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       File Protocol
> Plug-in (protocol-file)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       URL Query Filter
> (query-url)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       Parse MS
> Documents Framework (lib-parsems)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
> Extension-Points:
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       Nutch Online
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       HTML Parse 
> Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -       Ontology Model
> Loader (org.apache.nutch.ontology.Ontology)
> 2007-05-24 14:04:17,875 INFO  crawl.Injector - Injector: Merging injected
> urls into crawl db.
> 2007-05-24 14:04:17,906 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:18,156 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:18,375 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes
> where applicable
> 2007-05-24 14:04:19,281 INFO  crawl.Injector - Injector: done
> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: Selecting
> best-scoring urls due for fetch.
> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: starting
> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: segment:
> crawl/segments/20070524140420
> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: filtering:
> false
> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: topN:
> 2147483647
> 2007-05-24 14:04:20,312 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:20,312 INFO  crawl.Generator - Generator: jobtracker is
> 'local', generating exactly one partition.
> 2007-05-24 14:04:20,562 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:20,609 INFO  plugin.PluginRepository - Plugins: looking
> in: C:\nutch-0.9\plugins
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
> Plugins:
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       the nutch core
> extension points (nutch-extensionpoints)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       MSPowerPoint
> Parse Plug-in (parse-mspowerpoint)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       Basic Query
> Filter (query-basic)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       Basic Indexing
> Filter (index-basic)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       Html Parse
> Plug-in (parse-html)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       Pdf Parse 
> Plug-in
> (parse-pdf)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       Site Query 
> Filter
> (query-site)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       Jakarta POI -
> Java API To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       Text Parse
> Plug-in (parse-text)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       MSWord Parse
> Plug-in (parse-msword)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       SMB Protocol
> Plug-in (protocol-smb)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       MSExcel Parse
> Plug-in (parse-msexcel)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       OPIC Scoring
> Plug-in (scoring-opic)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       Log4j 
> (lib-log4j)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       File Protocol
> Plug-in (protocol-file)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       URL Query Filter
> (query-url)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       Parse MS
> Documents Framework (lib-parsems)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
> Extension-Points:
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       Nutch Online
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       HTML Parse 
> Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -       Ontology Model
> Loader (org.apache.nutch.ontology.Ontology)
> 2007-05-24 14:04:20,796 WARN  crawl.PartitionUrlByHost - Malformed URL:
> 'smb://sql1/Sales/DATA/'
> 2007-05-24 14:04:20,843 INFO  plugin.PluginRepository - Plugins: looking
> in: C:\nutch-0.9\plugins
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
> Plugins:
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       the nutch core
> extension points (nutch-extensionpoints)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       MSPowerPoint
> Parse Plug-in (parse-mspowerpoint)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       Basic Query
> Filter (query-basic)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       Basic Indexing
> Filter (index-basic)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       Html Parse
> Plug-in (parse-html)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       Pdf Parse 
> Plug-in
> (parse-pdf)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       Site Query 
> Filter
> (query-site)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       Jakarta POI -
> Java API To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       Text Parse
> Plug-in (parse-text)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       MSWord Parse
> Plug-in (parse-msword)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       SMB Protocol
> Plug-in (protocol-smb)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       MSExcel Parse
> Plug-in (parse-msexcel)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       OPIC Scoring
> Plug-in (scoring-opic)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       Log4j 
> (lib-log4j)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       File Protocol
> Plug-in (protocol-file)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       URL Query Filter
> (query-url)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       Parse MS
> Documents Framework (lib-parsems)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
> Extension-Points:
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       Nutch Online
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       HTML Parse 
> Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -       Ontology Model
> Loader (org.apache.nutch.ontology.Ontology)
> 2007-05-24 14:04:21,578 INFO  crawl.Generator - Generator: Partitioning
> selected urls by host, for politeness.
> 2007-05-24 14:04:21,593 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:21,828 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:21,859 INFO  plugin.PluginRepository - Plugins: looking
> in: C:\nutch-0.9\plugins
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
> Plugins:
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       the nutch core
> extension points (nutch-extensionpoints)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       MSPowerPoint
> Parse Plug-in (parse-mspowerpoint)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       Basic Query
> Filter (query-basic)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       Basic Indexing
> Filter (index-basic)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       Html Parse
> Plug-in (parse-html)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       Pdf Parse 
> Plug-in
> (parse-pdf)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       Site Query 
> Filter
> (query-site)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       Jakarta POI -
> Java API To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       Text Parse
> Plug-in (parse-text)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       MSWord Parse
> Plug-in (parse-msword)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       SMB Protocol
> Plug-in (protocol-smb)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       MSExcel Parse
> Plug-in (parse-msexcel)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       OPIC Scoring
> Plug-in (scoring-opic)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       Log4j 
> (lib-log4j)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       File Protocol
> Plug-in (protocol-file)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       URL Query Filter
> (query-url)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       Parse MS
> Documents Framework (lib-parsems)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
> Extension-Points:
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       Nutch Online
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       HTML Parse 
> Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -       Ontology Model
> Loader (org.apache.nutch.ontology.Ontology)
> 2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
> 'smb://sql1/Sales/DATA/'
> 2007-05-24 14:04:22,843 INFO  crawl.Generator - Generator: done.
> 2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: starting
> 2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: segment:
> crawl/segments/20070524140420
> 2007-05-24 14:04:22,859 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:23,156 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:23,187 INFO  fetcher.Fetcher - Fetcher: threads: 10
> 2007-05-24 14:04:23,203 INFO  plugin.PluginRepository - Plugins: looking
> in: C:\nutch-0.9\plugins
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
> Plugins:
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       the nutch core
> extension points (nutch-extensionpoints)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       MSPowerPoint
> Parse Plug-in (parse-mspowerpoint)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       Basic Query
> Filter (query-basic)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       Basic Indexing
> Filter (index-basic)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       Html Parse
> Plug-in (parse-html)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       Pdf Parse 
> Plug-in
> (parse-pdf)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       Site Query 
> Filter
> (query-site)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       Jakarta POI -
> Java API To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       Text Parse
> Plug-in (parse-text)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       MSWord Parse
> Plug-in (parse-msword)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       SMB Protocol
> Plug-in (protocol-smb)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       MSExcel Parse
> Plug-in (parse-msexcel)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       OPIC Scoring
> Plug-in (scoring-opic)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       Log4j 
> (lib-log4j)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       File Protocol
> Plug-in (protocol-file)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       URL Query Filter
> (query-url)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       Parse MS
> Documents Framework (lib-parsems)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
> Extension-Points:
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       Nutch Online
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       HTML Parse 
> Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -       Ontology Model
> Loader (org.apache.nutch.ontology.Ontology)
> 2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetching
> smb://sql1/Sales/DATA/
> 2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetch of
> smb://sql1/Sales/DATA/ failed with:
> org.apache.nutch.protocol.ProtocolNotFound:
> java.net.MalformedURLException: unknown protocol: smb
> 2007-05-24 14:04:23,500 INFO  fetcher.Fetcher - fetching
> file:///C:/Policies/
> 2007-05-24 14:04:23,718 INFO  crawl.SignatureFactory - Using Signature
> impl: org.apache.nutch.crawl.MD5Signature
> 2007-05-24 14:04:24,671 INFO  plugin.PluginRepository - Plugins: looking
> in: C:\nutch-0.9\plugins
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
> Plugins:
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       the nutch core
> extension points (nutch-extensionpoints)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       MSPowerPoint
> Parse Plug-in (parse-mspowerpoint)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       Basic Query
> Filter (query-basic)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       Basic Indexing
> Filter (index-basic)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       Html Parse
> Plug-in (parse-html)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       Pdf Parse 
> Plug-in
> (parse-pdf)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       Site Query 
> Filter
> (query-site)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       Jakarta POI -
> Java API To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       Text Parse
> Plug-in (parse-text)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       MSWord Parse
> Plug-in (parse-msword)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       SMB Protocol
> Plug-in (protocol-smb)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       MSExcel Parse
> Plug-in (parse-msexcel)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       OPIC Scoring
> Plug-in (scoring-opic)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       Log4j 
> (lib-log4j)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       File Protocol
> Plug-in (protocol-file)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       URL Query Filter
> (query-url)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       Parse MS
> Documents Framework (lib-parsems)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
> Extension-Points:
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       Nutch Online
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       HTML Parse 
> Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -       Ontology Model
> Loader (org.apache.nutch.ontology.Ontology)
> 2007-05-24 14:04:25,171 INFO  fetcher.Fetcher - Fetcher: done
> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: starting
> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: db:
> crawl/crawldb
> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: segments:
> [crawl/segments/20070524140420]
> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: additions
> allowed: true
> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
> normalizing: true
> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
> filtering: true
> 2007-05-24 14:04:25,203 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:25,203 INFO  crawl.CrawlDb - CrawlDb update: Merging
> segment data into db.
> 2007-05-24 14:04:25,421 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:25,468 INFO  plugin.PluginRepository - Plugins: looking
> in: C:\nutch-0.9\plugins
> 2007-05-24 14:04:25,593 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 
> 
> Thank you for reading my post, hope you can help.
> 
> Regards,
> 
> Oli
> 

-- 
View this message in context: 
http://www.nabble.com/WIN-XP-PRO--Djava.protocol*-file%3A---c%3A-folder--Crawling-Parents-tf3809966.html#a10806240
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to