Could you solve the problem? 

I get about 800kb/s as transfer speed wich is not so fast to use it in
productiv enviroment, what about you?



opoole wrote:
> 
> Sorry Vadim,
> 
> I did not realise you had sent me the email [Doh!].
> 
> 
> Vadim B wrote:
>> 
>> Hi,
>> 
>> I am working on the same issue as you, So far I could crawl file:///C:/*
>> but i am stucked on the smb part. It looks to me that this plugin isn't
>> working properly so it needs to be fixed for the newer version of nutch.
>> 
>> The error I get differs a bit from yours it is:
>> 
>> 2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetching
>> smb://mobidick/test/
>> 2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetch of
>> smb://mobidick/test/ failed with:
>> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
>> url=smb
>> 
>> I will dive into the plugin-smb and try out to narrow the problem Maybe
>> we can work together to get a quick solution.
>> 
>> 
>> 
>> ---SNIP---
>> 
>> # accept hosts in MY.DOMAIN.NAME
>> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>> +^file:///C:/Policies/ <<-- why you put it here it doesn't make sense
>> because the +^(file|smb) line above is already fitting so this will be
>> skipped 
>> ---SNIP ---
>> 
>> ---SNIP ---
>> 2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
>> 'smb://sql1/Sales/DATA/' 
>> //did you cuoted the url or is it displayed in the logs like this? I dont
>> get this error 
>> ---SNIP ---
>> 
>> try this  in package org.apache.nutch.crawl.Crawl
>> 
>>   public static void main(String args[]) throws Exception {
>>        System.setProperty("java.protocol.handler.pkgs", "jcifs"); // new 
>>        LOG.info("SMB Info: " +
>> System.getProperty("java.protocol.handler.pkgs")); //new 
>>        LOG.info("SMB Info: " +  new
>> java.util.PropertyPermission("java.protocol.handler.pkgs","read,
>> write").toString());//new 
>>        if (args.length < 1) {
>>       System.out.println
>>         ("Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN
>> N]");
>>       return;
>>     }
>> ---SNIP---
>> 
>> check out this:
>> http://java.sun.com/developer/onlineTraining/protocolhandlers/
>> 
>> 
>> 
>> 
>> 
>> opoole wrote:
>>> 
>>> Hi All, I hope you can help as I am becomming rather depressed with
>>> Nutch on Windows.
>>> 
>>> Using: Windows XP Pro SP2 - Nutch-0.9 - Cygwin [current version from
>>> cygwin site] - Java JDK 1.6.0 - Java Platform Standard Edition 1.6.0
>>> 
>>> I cannot stop Nutch from crawling parent directories, I have looked at
>>> other threads and none seem to work.
>>> 
>>> I have tried to include protocol-smb [jcifs] but Cygwin keeps prompting
>>> for Java syntax corrections.
>>> 
>>> Below I have listed my configurations along with the command I type in
>>> cygwin for jcifs:
>>> 
>>> CRAWL-URLFILTER
>>> # The url filter file used by the crawl command.
>>> 
>>> # Better for intranet crawling.
>>> # Be sure to change MY.DOMAIN.NAME to your domain name.
>>> 
>>> # Each non-comment, non-blank line contains a regular expression
>>> # prefixed by '+' or '-'.  The first matching pattern in the file
>>> # determines whether a URL is included or ignored.  If no pattern
>>> # matches, the URL is ignored.
>>> 
>>> # skip file:, ftp:, & mailto: urls
>>> -^(http|ftp|mailto):
>>> +^(file|smb):
>>> 
>>> # skip image and other suffixes we can't yet parse
>>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>>> 
>>> # skip URLs containing certain characters as probable queries, etc.
>>> 
>>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>>> loops
>>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>> 
>>> # accept hosts in MY.DOMAIN.NAME
>>> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>>> +^file:///C:/Policies/ <<-- why you put it here it doesnt make sese 
>>> because the +^(file|smb) is already fitting !
>>> 
>>> # skip everything else
>>> -.
>>> 
>>> NUTCH-SITE
>>> 
>>> <?xml version="1.0"?>
>>> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
>>> <!-- Put site-specific property overrides in this file. -->
>>> 
>>> <nutch-conf>
>>> 
>>> <property>
>>>  <name>http.agent.name</name>
>>>  <value>pascall</value>
>>>  <description></description>
>>> </property>
>>> 
>>> <property>
>>>   <name>file.content.limit</name>
>>>   <value>-1</value>
>>>   <description>The length limit for downloaded content, in bytes.
>>>   If this value is nonnegative (>=0), content longer than it will be
>>> truncated;
>>>   otherwise, no truncation at all.
>>>   </description>
>>> </property>
>>> 
>>> <property>
>>>   <name>file.crawl.parent</name>
>>>   <value>false</value>
>>>   <description>The crawler is not restricted to the directories that you
>>> specified in the
>>>     Urls file but it is jumping into the parent directories as well. For
>>> your own crawlings you can
>>>     change this bahavior (set to false) the way that only directories
>>> beneath the directories that you specify get
>>>     crawled.</description>
>>> </property>
>>> 
>>> <property>
>>> <name>plugin.includes</name> 
>>> <value>protocol-file|protocol-smb|scoring-opic|parse-(msexcel|mspowerpoint|msword|xml|text|html|pdf)|index-basic|query-(basic|site|url)</value>
>>> </property> 
>>> 
>>> </nutch-conf>
>>> 
>>> CYGWIN
>>> 
>>> Using cygwin I enter the command from C:\Sun\Java\jdk160\bin\
>>> 
>>> java -Djava.protocol.handler.pkgs=jcifs
>>> 
>>> When I press return the cygwin shell displays a list of java commands as
>>> though I am using incorrect syntax.
>>> 
>>> Dump of Crawl from Cygwin:
>>> 
>>> 2007-05-24 14:04:16,140 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - crawl started in: crawl
>>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - rootUrlDir = urls.txt
>>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - threads = 10
>>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - depth = 5
>>> 2007-05-24 14:04:16,281 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: starting
>>> 2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: crawlDb:
>>> crawl/crawldb
>>> 2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: urlDir:
>>> urls.txt
>>> 2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: Converting
>>> injected urls to crawl db entries.
>>> 2007-05-24 14:04:16,328 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:16,843 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:16,953 INFO  plugin.PluginRepository - Plugins: looking
>>> in: C:\nutch-0.9\plugins
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Plugin
>>> Auto-activation mode: [true]
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
>>> Plugins:
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     the nutch core
>>> extension points (nutch-extensionpoints)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     MSPowerPoint
>>> Parse Plug-in (parse-mspowerpoint)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     Basic Query
>>> Filter (query-basic)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     Basic Indexing
>>> Filter (index-basic)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     Html Parse
>>> Plug-in (parse-html)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     Pdf Parse
>>> Plug-in (parse-pdf)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     Site Query
>>> Filter (query-site)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     Jakarta POI -
>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     Text Parse
>>> Plug-in (parse-text)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     MSWord Parse
>>> Plug-in (parse-msword)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     SMB Protocol
>>> Plug-in (protocol-smb)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     MSExcel Parse
>>> Plug-in (parse-msexcel)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     OPIC Scoring
>>> Plug-in (scoring-opic)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     CyberNeko HTML
>>> Parser (lib-nekohtml)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     Log4j
>>> (lib-log4j)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     File Protocol
>>> Plug-in (protocol-file)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     URL Query
>>> Filter (query-url)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     Parse MS
>>> Documents Framework (lib-parsems)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
>>> Extension-Points:
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     Nutch
>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     Nutch URL
>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     Nutch Protocol
>>> (org.apache.nutch.protocol.Protocol)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     Nutch Analysis
>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     Nutch URL
>>> Filter (org.apache.nutch.net.URLFilter)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     Nutch Indexing
>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     Nutch Online
>>> Search Results Clustering Plugin
>>> (org.apache.nutch.clustering.OnlineClusterer)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     HTML Parse
>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     Nutch Content
>>> Parser (org.apache.nutch.parse.Parser)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     Nutch Scoring
>>> (org.apache.nutch.scoring.ScoringFilter)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     Nutch Query
>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -     Ontology Model
>>> Loader (org.apache.nutch.ontology.Ontology)
>>> 2007-05-24 14:04:17,875 INFO  crawl.Injector - Injector: Merging
>>> injected urls into crawl db.
>>> 2007-05-24 14:04:17,906 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:18,156 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:18,375 WARN  util.NativeCodeLoader - Unable to load
>>> native-hadoop library for your platform... using builtin-java classes
>>> where applicable
>>> 2007-05-24 14:04:19,281 INFO  crawl.Injector - Injector: done
>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: Selecting
>>> best-scoring urls due for fetch.
>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: starting
>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: segment:
>>> crawl/segments/20070524140420
>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: filtering:
>>> false
>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: topN:
>>> 2147483647
>>> 2007-05-24 14:04:20,312 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:20,312 INFO  crawl.Generator - Generator: jobtracker is
>>> 'local', generating exactly one partition.
>>> 2007-05-24 14:04:20,562 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:20,609 INFO  plugin.PluginRepository - Plugins: looking
>>> in: C:\nutch-0.9\plugins
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Plugin
>>> Auto-activation mode: [true]
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
>>> Plugins:
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     the nutch core
>>> extension points (nutch-extensionpoints)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     MSPowerPoint
>>> Parse Plug-in (parse-mspowerpoint)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     Basic Query
>>> Filter (query-basic)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     Basic Indexing
>>> Filter (index-basic)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     Html Parse
>>> Plug-in (parse-html)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     Pdf Parse
>>> Plug-in (parse-pdf)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     Site Query
>>> Filter (query-site)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     Jakarta POI -
>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     Text Parse
>>> Plug-in (parse-text)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     MSWord Parse
>>> Plug-in (parse-msword)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     SMB Protocol
>>> Plug-in (protocol-smb)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     MSExcel Parse
>>> Plug-in (parse-msexcel)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     OPIC Scoring
>>> Plug-in (scoring-opic)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     CyberNeko HTML
>>> Parser (lib-nekohtml)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     Log4j
>>> (lib-log4j)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     File Protocol
>>> Plug-in (protocol-file)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     URL Query
>>> Filter (query-url)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     Parse MS
>>> Documents Framework (lib-parsems)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
>>> Extension-Points:
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     Nutch
>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     Nutch URL
>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     Nutch Protocol
>>> (org.apache.nutch.protocol.Protocol)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     Nutch Analysis
>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     Nutch URL
>>> Filter (org.apache.nutch.net.URLFilter)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     Nutch Indexing
>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     Nutch Online
>>> Search Results Clustering Plugin
>>> (org.apache.nutch.clustering.OnlineClusterer)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     HTML Parse
>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     Nutch Content
>>> Parser (org.apache.nutch.parse.Parser)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     Nutch Scoring
>>> (org.apache.nutch.scoring.ScoringFilter)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     Nutch Query
>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -     Ontology Model
>>> Loader (org.apache.nutch.ontology.Ontology)
>>> 2007-05-24 14:04:20,796 WARN  crawl.PartitionUrlByHost - Malformed URL:
>>> 'smb://sql1/Sales/DATA/'
>>> 2007-05-24 14:04:20,843 INFO  plugin.PluginRepository - Plugins: looking
>>> in: C:\nutch-0.9\plugins
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Plugin
>>> Auto-activation mode: [true]
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
>>> Plugins:
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     the nutch core
>>> extension points (nutch-extensionpoints)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     MSPowerPoint
>>> Parse Plug-in (parse-mspowerpoint)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     Basic Query
>>> Filter (query-basic)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     Basic Indexing
>>> Filter (index-basic)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     Html Parse
>>> Plug-in (parse-html)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     Pdf Parse
>>> Plug-in (parse-pdf)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     Site Query
>>> Filter (query-site)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     Jakarta POI -
>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     Text Parse
>>> Plug-in (parse-text)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     MSWord Parse
>>> Plug-in (parse-msword)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     SMB Protocol
>>> Plug-in (protocol-smb)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     MSExcel Parse
>>> Plug-in (parse-msexcel)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     OPIC Scoring
>>> Plug-in (scoring-opic)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     CyberNeko HTML
>>> Parser (lib-nekohtml)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     Log4j
>>> (lib-log4j)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     File Protocol
>>> Plug-in (protocol-file)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     URL Query
>>> Filter (query-url)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     Parse MS
>>> Documents Framework (lib-parsems)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
>>> Extension-Points:
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     Nutch
>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     Nutch URL
>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     Nutch Protocol
>>> (org.apache.nutch.protocol.Protocol)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     Nutch Analysis
>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     Nutch URL
>>> Filter (org.apache.nutch.net.URLFilter)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     Nutch Indexing
>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     Nutch Online
>>> Search Results Clustering Plugin
>>> (org.apache.nutch.clustering.OnlineClusterer)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     HTML Parse
>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     Nutch Content
>>> Parser (org.apache.nutch.parse.Parser)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     Nutch Scoring
>>> (org.apache.nutch.scoring.ScoringFilter)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     Nutch Query
>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -     Ontology Model
>>> Loader (org.apache.nutch.ontology.Ontology)
>>> 2007-05-24 14:04:21,578 INFO  crawl.Generator - Generator: Partitioning
>>> selected urls by host, for politeness.
>>> 2007-05-24 14:04:21,593 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:21,828 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:21,859 INFO  plugin.PluginRepository - Plugins: looking
>>> in: C:\nutch-0.9\plugins
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Plugin
>>> Auto-activation mode: [true]
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
>>> Plugins:
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     the nutch core
>>> extension points (nutch-extensionpoints)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     MSPowerPoint
>>> Parse Plug-in (parse-mspowerpoint)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     Basic Query
>>> Filter (query-basic)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     Basic Indexing
>>> Filter (index-basic)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     Html Parse
>>> Plug-in (parse-html)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     Pdf Parse
>>> Plug-in (parse-pdf)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     Site Query
>>> Filter (query-site)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     Jakarta POI -
>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     Text Parse
>>> Plug-in (parse-text)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     MSWord Parse
>>> Plug-in (parse-msword)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     SMB Protocol
>>> Plug-in (protocol-smb)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     MSExcel Parse
>>> Plug-in (parse-msexcel)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     OPIC Scoring
>>> Plug-in (scoring-opic)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     CyberNeko HTML
>>> Parser (lib-nekohtml)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     Log4j
>>> (lib-log4j)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     File Protocol
>>> Plug-in (protocol-file)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     URL Query
>>> Filter (query-url)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     Parse MS
>>> Documents Framework (lib-parsems)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
>>> Extension-Points:
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     Nutch
>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     Nutch URL
>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     Nutch Protocol
>>> (org.apache.nutch.protocol.Protocol)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     Nutch Analysis
>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     Nutch URL
>>> Filter (org.apache.nutch.net.URLFilter)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     Nutch Indexing
>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     Nutch Online
>>> Search Results Clustering Plugin
>>> (org.apache.nutch.clustering.OnlineClusterer)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     HTML Parse
>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     Nutch Content
>>> Parser (org.apache.nutch.parse.Parser)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     Nutch Scoring
>>> (org.apache.nutch.scoring.ScoringFilter)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     Nutch Query
>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -     Ontology Model
>>> Loader (org.apache.nutch.ontology.Ontology)
>>> 2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
>>> 'smb://sql1/Sales/DATA/'
>>> 2007-05-24 14:04:22,843 INFO  crawl.Generator - Generator: done.
>>> 2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: starting
>>> 2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: segment:
>>> crawl/segments/20070524140420
>>> 2007-05-24 14:04:22,859 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:23,156 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:23,187 INFO  fetcher.Fetcher - Fetcher: threads: 10
>>> 2007-05-24 14:04:23,203 INFO  plugin.PluginRepository - Plugins: looking
>>> in: C:\nutch-0.9\plugins
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Plugin
>>> Auto-activation mode: [true]
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
>>> Plugins:
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     the nutch core
>>> extension points (nutch-extensionpoints)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     MSPowerPoint
>>> Parse Plug-in (parse-mspowerpoint)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     Basic Query
>>> Filter (query-basic)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     Basic Indexing
>>> Filter (index-basic)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     Html Parse
>>> Plug-in (parse-html)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     Pdf Parse
>>> Plug-in (parse-pdf)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     Site Query
>>> Filter (query-site)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     Jakarta POI -
>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     Text Parse
>>> Plug-in (parse-text)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     MSWord Parse
>>> Plug-in (parse-msword)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     SMB Protocol
>>> Plug-in (protocol-smb)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     MSExcel Parse
>>> Plug-in (parse-msexcel)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     OPIC Scoring
>>> Plug-in (scoring-opic)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     CyberNeko HTML
>>> Parser (lib-nekohtml)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     Log4j
>>> (lib-log4j)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     File Protocol
>>> Plug-in (protocol-file)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     URL Query
>>> Filter (query-url)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     Parse MS
>>> Documents Framework (lib-parsems)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
>>> Extension-Points:
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     Nutch
>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     Nutch URL
>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     Nutch Protocol
>>> (org.apache.nutch.protocol.Protocol)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     Nutch Analysis
>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     Nutch URL
>>> Filter (org.apache.nutch.net.URLFilter)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     Nutch Indexing
>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     Nutch Online
>>> Search Results Clustering Plugin
>>> (org.apache.nutch.clustering.OnlineClusterer)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     HTML Parse
>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     Nutch Content
>>> Parser (org.apache.nutch.parse.Parser)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     Nutch Scoring
>>> (org.apache.nutch.scoring.ScoringFilter)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     Nutch Query
>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -     Ontology Model
>>> Loader (org.apache.nutch.ontology.Ontology)
>>> 2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetching
>>> smb://sql1/Sales/DATA/
>>> 2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetch of
>>> smb://sql1/Sales/DATA/ failed with:
>>> org.apache.nutch.protocol.ProtocolNotFound:
>>> java.net.MalformedURLException: unknown protocol: smb
>>> 2007-05-24 14:04:23,500 INFO  fetcher.Fetcher - fetching
>>> file:///C:/Policies/
>>> 2007-05-24 14:04:23,718 INFO  crawl.SignatureFactory - Using Signature
>>> impl: org.apache.nutch.crawl.MD5Signature
>>> 2007-05-24 14:04:24,671 INFO  plugin.PluginRepository - Plugins: looking
>>> in: C:\nutch-0.9\plugins
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Plugin
>>> Auto-activation mode: [true]
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
>>> Plugins:
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     the nutch core
>>> extension points (nutch-extensionpoints)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     MSPowerPoint
>>> Parse Plug-in (parse-mspowerpoint)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     Basic Query
>>> Filter (query-basic)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     Basic Indexing
>>> Filter (index-basic)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     Html Parse
>>> Plug-in (parse-html)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     Pdf Parse
>>> Plug-in (parse-pdf)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     Site Query
>>> Filter (query-site)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     Jakarta POI -
>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     Text Parse
>>> Plug-in (parse-text)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     MSWord Parse
>>> Plug-in (parse-msword)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     SMB Protocol
>>> Plug-in (protocol-smb)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     MSExcel Parse
>>> Plug-in (parse-msexcel)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     OPIC Scoring
>>> Plug-in (scoring-opic)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     CyberNeko HTML
>>> Parser (lib-nekohtml)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     Log4j
>>> (lib-log4j)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     File Protocol
>>> Plug-in (protocol-file)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     URL Query
>>> Filter (query-url)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     Parse MS
>>> Documents Framework (lib-parsems)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
>>> Extension-Points:
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     Nutch
>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     Nutch URL
>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     Nutch Protocol
>>> (org.apache.nutch.protocol.Protocol)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     Nutch Analysis
>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     Nutch URL
>>> Filter (org.apache.nutch.net.URLFilter)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     Nutch Indexing
>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     Nutch Online
>>> Search Results Clustering Plugin
>>> (org.apache.nutch.clustering.OnlineClusterer)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     HTML Parse
>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     Nutch Content
>>> Parser (org.apache.nutch.parse.Parser)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     Nutch Scoring
>>> (org.apache.nutch.scoring.ScoringFilter)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     Nutch Query
>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -     Ontology Model
>>> Loader (org.apache.nutch.ontology.Ontology)
>>> 2007-05-24 14:04:25,171 INFO  fetcher.Fetcher - Fetcher: done
>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: starting
>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: db:
>>> crawl/crawldb
>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: segments:
>>> [crawl/segments/20070524140420]
>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: additions
>>> allowed: true
>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
>>> normalizing: true
>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
>>> filtering: true
>>> 2007-05-24 14:04:25,203 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:25,203 INFO  crawl.CrawlDb - CrawlDb update: Merging
>>> segment data into db.
>>> 2007-05-24 14:04:25,421 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:25,468 INFO  plugin.PluginRepository - Plugins: looking
>>> in: C:\nutch-0.9\plugins
>>> 2007-05-24 14:04:25,593 INFO  plugin.PluginRepository - Plugin
>>> Auto-activation mode: [true]
>>> 
>>> 
>>> Thank you for reading my post, hope you can help.
>>> 
>>> Regards,
>>> 
>>> Oli
>>> 
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/WIN-XP-PRO--Djava.protocol*-file%3A---c%3A-folder--Crawling-Parents-tf3809966.html#a10968398
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to