Hi,

Thanks for your help with this, I was sent an email from someone stating
that this is fixed using a new version of the jcifs implementation:

https://issues.apache.org/jira/browse/NUTCH-427

Give it a go and let me know if it works ;)


Vadim B wrote:
> 
> Hi,
> 
> I am working on the same issue as you, So far I could crawl file:///C:/*
> but i am stucked on the smb part. It looks to me that this plugin isn't
> working properly so it needs to be fixed for the newer version of nutch.
> 
> The error I get differs a bit from yours it is:
> 
> 2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetching
> smb://mobidick/test/
> 2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetch of
> smb://mobidick/test/ failed with:
> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=smb
> 
> I will dive into the plugin-smb and try out to narrow the problem Maybe we
> can work together to get a quick solution.
> 
> 
> 
> ---SNIP---
> 
> # accept hosts in MY.DOMAIN.NAME
> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> +^file:///C:/Policies/ <<-- why you put it here it doesn't make sense
> because the +^(file|smb) line above is already fitting so this will be
> skipped 
> ---SNIP ---
> 
> ---SNIP ---
> 2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
> 'smb://sql1/Sales/DATA/' 
> //did you cuoted the url or is it displayed in the logs like this? I dont
> get this error 
> ---SNIP ---
> 
> try this  in package org.apache.nutch.crawl.Crawl
> 
>   public static void main(String args[]) throws Exception {
>         System.setProperty("java.protocol.handler.pkgs", "jcifs"); // new 
>         LOG.info("SMB Info: " +
> System.getProperty("java.protocol.handler.pkgs")); //new 
>         LOG.info("SMB Info: " +  new
> java.util.PropertyPermission("java.protocol.handler.pkgs","read,
> write").toString());//new 
>         if (args.length < 1) {
>       System.out.println
>         ("Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN
> N]");
>       return;
>     }
> ---SNIP---
> 
> check out this:
> http://java.sun.com/developer/onlineTraining/protocolhandlers/
> 
> 
> 
> 
> 
> opoole wrote:
>> 
>> Hi All, I hope you can help as I am becomming rather depressed with Nutch
>> on Windows.
>> 
>> Using: Windows XP Pro SP2 - Nutch-0.9 - Cygwin [current version from
>> cygwin site] - Java JDK 1.6.0 - Java Platform Standard Edition 1.6.0
>> 
>> I cannot stop Nutch from crawling parent directories, I have looked at
>> other threads and none seem to work.
>> 
>> I have tried to include protocol-smb [jcifs] but Cygwin keeps prompting
>> for Java syntax corrections.
>> 
>> Below I have listed my configurations along with the command I type in
>> cygwin for jcifs:
>> 
>> CRAWL-URLFILTER
>> # The url filter file used by the crawl command.
>> 
>> # Better for intranet crawling.
>> # Be sure to change MY.DOMAIN.NAME to your domain name.
>> 
>> # Each non-comment, non-blank line contains a regular expression
>> # prefixed by '+' or '-'.  The first matching pattern in the file
>> # determines whether a URL is included or ignored.  If no pattern
>> # matches, the URL is ignored.
>> 
>> # skip file:, ftp:, & mailto: urls
>> -^(http|ftp|mailto):
>> +^(file|smb):
>> 
>> # skip image and other suffixes we can't yet parse
>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>> 
>> # skip URLs containing certain characters as probable queries, etc.
>> 
>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>> loops
>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>> 
>> # accept hosts in MY.DOMAIN.NAME
>> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>> +^file:///C:/Policies/ <<-- why you put it here it doesnt make sese 
>> because the +^(file|smb) is already fitting !
>> 
>> # skip everything else
>> -.
>> 
>> NUTCH-SITE
>> 
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> 
>> <nutch-conf>
>> 
>> <property>
>>  <name>http.agent.name</name>
>>  <value>pascall</value>
>>  <description></description>
>> </property>
>> 
>> <property>
>>   <name>file.content.limit</name>
>>   <value>-1</value>
>>   <description>The length limit for downloaded content, in bytes.
>>   If this value is nonnegative (>=0), content longer than it will be
>> truncated;
>>   otherwise, no truncation at all.
>>   </description>
>> </property>
>> 
>> <property>
>>   <name>file.crawl.parent</name>
>>   <value>false</value>
>>   <description>The crawler is not restricted to the directories that you
>> specified in the
>>     Urls file but it is jumping into the parent directories as well. For
>> your own crawlings you can
>>     change this bahavior (set to false) the way that only directories
>> beneath the directories that you specify get
>>     crawled.</description>
>> </property>
>> 
>> <property>
>> <name>plugin.includes</name> 
>> <value>protocol-file|protocol-smb|scoring-opic|parse-(msexcel|mspowerpoint|msword|xml|text|html|pdf)|index-basic|query-(basic|site|url)</value>
>> </property> 
>> 
>> </nutch-conf>
>> 
>> CYGWIN
>> 
>> Using cygwin I enter the command from C:\Sun\Java\jdk160\bin\
>> 
>> java -Djava.protocol.handler.pkgs=jcifs
>> 
>> When I press return the cygwin shell displays a list of java commands as
>> though I am using incorrect syntax.
>> 
>> Dump of Crawl from Cygwin:
>> 
>> 2007-05-24 14:04:16,140 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - crawl started in: crawl
>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - rootUrlDir = urls.txt
>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - threads = 10
>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - depth = 5
>> 2007-05-24 14:04:16,281 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: starting
>> 2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: crawlDb:
>> crawl/crawldb
>> 2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: urlDir: urls.txt
>> 2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: Converting
>> injected urls to crawl db entries.
>> 2007-05-24 14:04:16,328 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:16,843 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:16,953 INFO  plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -      Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:17,875 INFO  crawl.Injector - Injector: Merging injected
>> urls into crawl db.
>> 2007-05-24 14:04:17,906 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:18,156 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:18,375 WARN  util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where applicable
>> 2007-05-24 14:04:19,281 INFO  crawl.Injector - Injector: done
>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: Selecting
>> best-scoring urls due for fetch.
>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: starting
>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: segment:
>> crawl/segments/20070524140420
>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: filtering:
>> false
>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: topN:
>> 2147483647
>> 2007-05-24 14:04:20,312 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:20,312 INFO  crawl.Generator - Generator: jobtracker is
>> 'local', generating exactly one partition.
>> 2007-05-24 14:04:20,562 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:20,609 INFO  plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -      Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:20,796 WARN  crawl.PartitionUrlByHost - Malformed URL:
>> 'smb://sql1/Sales/DATA/'
>> 2007-05-24 14:04:20,843 INFO  plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -      Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:21,578 INFO  crawl.Generator - Generator: Partitioning
>> selected urls by host, for politeness.
>> 2007-05-24 14:04:21,593 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:21,828 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:21,859 INFO  plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -      Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
>> 'smb://sql1/Sales/DATA/'
>> 2007-05-24 14:04:22,843 INFO  crawl.Generator - Generator: done.
>> 2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: starting
>> 2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: segment:
>> crawl/segments/20070524140420
>> 2007-05-24 14:04:22,859 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:23,156 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:23,187 INFO  fetcher.Fetcher - Fetcher: threads: 10
>> 2007-05-24 14:04:23,203 INFO  plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -      Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetching
>> smb://sql1/Sales/DATA/
>> 2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetch of
>> smb://sql1/Sales/DATA/ failed with:
>> org.apache.nutch.protocol.ProtocolNotFound:
>> java.net.MalformedURLException: unknown protocol: smb
>> 2007-05-24 14:04:23,500 INFO  fetcher.Fetcher - fetching
>> file:///C:/Policies/
>> 2007-05-24 14:04:23,718 INFO  crawl.SignatureFactory - Using Signature
>> impl: org.apache.nutch.crawl.MD5Signature
>> 2007-05-24 14:04:24,671 INFO  plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -      Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:25,171 INFO  fetcher.Fetcher - Fetcher: done
>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: starting
>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: db:
>> crawl/crawldb
>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: segments:
>> [crawl/segments/20070524140420]
>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: additions
>> allowed: true
>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
>> normalizing: true
>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
>> filtering: true
>> 2007-05-24 14:04:25,203 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:25,203 INFO  crawl.CrawlDb - CrawlDb update: Merging
>> segment data into db.
>> 2007-05-24 14:04:25,421 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:25,468 INFO  plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:25,593 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 
>> 
>> Thank you for reading my post, hope you can help.
>> 
>> Regards,
>> 
>> Oli
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/WIN-XP-PRO--Djava.protocol*-file%3A---c%3A-folder--Crawling-Parents-tf3809966.html#a10851108
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to