Hi Vadim,

To be honest I am somewhat behind you as my problem is that I cannot get the
SMB protocol setup, I am unable to get the -djava bit to do anything, I am
using cygwin and entering the command from within sun\java etc.

As for crawl speed, I'd love to get that far.

Also I noticed that you were crawling from the root of C:\ whereas I want to
crawl a specific folder and the parent directory issue crops up, I cannot
get it to stop crawling the parent.  One thing I had noticed is that I did
not have a URLFILTER entry in my nucth-config.xml and that makes a
difference in that if I try to set it up as in the tutorial it won't crawl a
thing??!!

Sorry I cannot be of help but I feel somewhat behind you in terms of Nutch
dev, I am thinking of trying Nutch using ver 8 instead of 9 as there is more
documented on it although I have read that it is slow, half the speed of ver
9 in terms of crawl speed, are you using ver 8?

Regards,

Oli


Vadim B wrote:
> 
> Could you solve the problem? 
> 
> I get about 800kb/s as transfer speed wich is not so fast to use it in
> productiv enviroment, what about you?
> 
> 
> 
> opoole wrote:
>> 
>> Sorry Vadim,
>> 
>> I did not realise you had sent me the email [Doh!].
>> 
>> 
>> Vadim B wrote:
>>> 
>>> Hi,
>>> 
>>> I am working on the same issue as you, So far I could crawl file:///C:/*
>>> but i am stucked on the smb part. It looks to me that this plugin isn't
>>> working properly so it needs to be fixed for the newer version of nutch.
>>> 
>>> The error I get differs a bit from yours it is:
>>> 
>>> 2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetching
>>> smb://mobidick/test/
>>> 2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetch of
>>> smb://mobidick/test/ failed with:
>>> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
>>> url=smb
>>> 
>>> I will dive into the plugin-smb and try out to narrow the problem Maybe
>>> we can work together to get a quick solution.
>>> 
>>> 
>>> 
>>> ---SNIP---
>>> 
>>> # accept hosts in MY.DOMAIN.NAME
>>> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>>> +^file:///C:/Policies/ <<-- why you put it here it doesn't make sense
>>> because the +^(file|smb) line above is already fitting so this will be
>>> skipped 
>>> ---SNIP ---
>>> 
>>> ---SNIP ---
>>> 2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
>>> 'smb://sql1/Sales/DATA/' 
>>> //did you cuoted the url or is it displayed in the logs like this? I
>>> dont get this error 
>>> ---SNIP ---
>>> 
>>> try this  in package org.apache.nutch.crawl.Crawl
>>> 
>>>   public static void main(String args[]) throws Exception {
>>>       System.setProperty("java.protocol.handler.pkgs", "jcifs"); // new 
>>>       LOG.info("SMB Info: " +
>>> System.getProperty("java.protocol.handler.pkgs")); //new 
>>>       LOG.info("SMB Info: " +  new
>>> java.util.PropertyPermission("java.protocol.handler.pkgs","read,
>>> write").toString());//new 
>>>       if (args.length < 1) {
>>>       System.out.println
>>>         ("Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN
>>> N]");
>>>       return;
>>>     }
>>> ---SNIP---
>>> 
>>> check out this:
>>> http://java.sun.com/developer/onlineTraining/protocolhandlers/
>>> 
>>> 
>>> 
>>> 
>>> 
>>> opoole wrote:
>>>> 
>>>> Hi All, I hope you can help as I am becomming rather depressed with
>>>> Nutch on Windows.
>>>> 
>>>> Using: Windows XP Pro SP2 - Nutch-0.9 - Cygwin [current version from
>>>> cygwin site] - Java JDK 1.6.0 - Java Platform Standard Edition 1.6.0
>>>> 
>>>> I cannot stop Nutch from crawling parent directories, I have looked at
>>>> other threads and none seem to work.
>>>> 
>>>> I have tried to include protocol-smb [jcifs] but Cygwin keeps prompting
>>>> for Java syntax corrections.
>>>> 
>>>> Below I have listed my configurations along with the command I type in
>>>> cygwin for jcifs:
>>>> 
>>>> CRAWL-URLFILTER
>>>> # The url filter file used by the crawl command.
>>>> 
>>>> # Better for intranet crawling.
>>>> # Be sure to change MY.DOMAIN.NAME to your domain name.
>>>> 
>>>> # Each non-comment, non-blank line contains a regular expression
>>>> # prefixed by '+' or '-'.  The first matching pattern in the file
>>>> # determines whether a URL is included or ignored.  If no pattern
>>>> # matches, the URL is ignored.
>>>> 
>>>> # skip file:, ftp:, & mailto: urls
>>>> -^(http|ftp|mailto):
>>>> +^(file|smb):
>>>> 
>>>> # skip image and other suffixes we can't yet parse
>>>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>>>> 
>>>> # skip URLs containing certain characters as probable queries, etc.
>>>> 
>>>> # skip URLs with slash-delimited segment that repeats 3+ times, to
>>>> break
>>>> loops
>>>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>>> 
>>>> # accept hosts in MY.DOMAIN.NAME
>>>> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>>>> +^file:///C:/Policies/ <<-- why you put it here it doesnt make sese 
>>>> because the +^(file|smb) is already fitting !
>>>> 
>>>> # skip everything else
>>>> -.
>>>> 
>>>> NUTCH-SITE
>>>> 
>>>> <?xml version="1.0"?>
>>>> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
>>>> <!-- Put site-specific property overrides in this file. -->
>>>> 
>>>> <nutch-conf>
>>>> 
>>>> <property>
>>>>  <name>http.agent.name</name>
>>>>  <value>pascall</value>
>>>>  <description></description>
>>>> </property>
>>>> 
>>>> <property>
>>>>   <name>file.content.limit</name>
>>>>   <value>-1</value>
>>>>   <description>The length limit for downloaded content, in bytes.
>>>>   If this value is nonnegative (>=0), content longer than it will be
>>>> truncated;
>>>>   otherwise, no truncation at all.
>>>>   </description>
>>>> </property>
>>>> 
>>>> <property>
>>>>   <name>file.crawl.parent</name>
>>>>   <value>false</value>
>>>>   <description>The crawler is not restricted to the directories that
>>>> you specified in the
>>>>     Urls file but it is jumping into the parent directories as well.
>>>> For your own crawlings you can
>>>>     change this bahavior (set to false) the way that only directories
>>>> beneath the directories that you specify get
>>>>     crawled.</description>
>>>> </property>
>>>> 
>>>> <property>
>>>> <name>plugin.includes</name> 
>>>> <value>protocol-file|protocol-smb|scoring-opic|parse-(msexcel|mspowerpoint|msword|xml|text|html|pdf)|index-basic|query-(basic|site|url)</value>
>>>> </property> 
>>>> 
>>>> </nutch-conf>
>>>> 
>>>> CYGWIN
>>>> 
>>>> Using cygwin I enter the command from C:\Sun\Java\jdk160\bin\
>>>> 
>>>> java -Djava.protocol.handler.pkgs=jcifs
>>>> 
>>>> When I press return the cygwin shell displays a list of java commands
>>>> as though I am using incorrect syntax.
>>>> 
>>>> Dump of Crawl from Cygwin:
>>>> 
>>>> 2007-05-24 14:04:16,140 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - crawl started in: crawl
>>>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - rootUrlDir = urls.txt
>>>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - threads = 10
>>>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - depth = 5
>>>> 2007-05-24 14:04:16,281 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: starting
>>>> 2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: crawlDb:
>>>> crawl/crawldb
>>>> 2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: urlDir:
>>>> urls.txt
>>>> 2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: Converting
>>>> injected urls to crawl db entries.
>>>> 2007-05-24 14:04:16,328 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:16,843 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:16,953 INFO  plugin.PluginRepository - Plugins:
>>>> looking in: C:\nutch-0.9\plugins
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Plugin
>>>> Auto-activation mode: [true]
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
>>>> Plugins:
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    the nutch core
>>>> extension points (nutch-extensionpoints)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    MSPowerPoint
>>>> Parse Plug-in (parse-mspowerpoint)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    Basic Query
>>>> Filter (query-basic)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    Basic Indexing
>>>> Filter (index-basic)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    Html Parse
>>>> Plug-in (parse-html)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    Pdf Parse
>>>> Plug-in (parse-pdf)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    Site Query
>>>> Filter (query-site)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    Jakarta POI -
>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    Text Parse
>>>> Plug-in (parse-text)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    MSWord Parse
>>>> Plug-in (parse-msword)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    SMB Protocol
>>>> Plug-in (protocol-smb)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    MSExcel Parse
>>>> Plug-in (parse-msexcel)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    OPIC Scoring
>>>> Plug-in (scoring-opic)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    CyberNeko HTML
>>>> Parser (lib-nekohtml)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    Log4j
>>>> (lib-log4j)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    File Protocol
>>>> Plug-in (protocol-file)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    URL Query
>>>> Filter (query-url)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    Parse MS
>>>> Documents Framework (lib-parsems)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
>>>> Extension-Points:
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    Nutch
>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    Nutch URL
>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    Nutch Protocol
>>>> (org.apache.nutch.protocol.Protocol)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    Nutch Analysis
>>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    Nutch URL
>>>> Filter (org.apache.nutch.net.URLFilter)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    Nutch Indexing
>>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    Nutch Online
>>>> Search Results Clustering Plugin
>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    HTML Parse
>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    Nutch Content
>>>> Parser (org.apache.nutch.parse.Parser)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    Nutch Scoring
>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    Nutch Query
>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository -    Ontology Model
>>>> Loader (org.apache.nutch.ontology.Ontology)
>>>> 2007-05-24 14:04:17,875 INFO  crawl.Injector - Injector: Merging
>>>> injected urls into crawl db.
>>>> 2007-05-24 14:04:17,906 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:18,156 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:18,375 WARN  util.NativeCodeLoader - Unable to load
>>>> native-hadoop library for your platform... using builtin-java classes
>>>> where applicable
>>>> 2007-05-24 14:04:19,281 INFO  crawl.Injector - Injector: done
>>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: Selecting
>>>> best-scoring urls due for fetch.
>>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: starting
>>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: segment:
>>>> crawl/segments/20070524140420
>>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: filtering:
>>>> false
>>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: topN:
>>>> 2147483647
>>>> 2007-05-24 14:04:20,312 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:20,312 INFO  crawl.Generator - Generator: jobtracker
>>>> is 'local', generating exactly one partition.
>>>> 2007-05-24 14:04:20,562 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:20,609 INFO  plugin.PluginRepository - Plugins:
>>>> looking in: C:\nutch-0.9\plugins
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Plugin
>>>> Auto-activation mode: [true]
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
>>>> Plugins:
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    the nutch core
>>>> extension points (nutch-extensionpoints)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    MSPowerPoint
>>>> Parse Plug-in (parse-mspowerpoint)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    Basic Query
>>>> Filter (query-basic)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    Basic Indexing
>>>> Filter (index-basic)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    Html Parse
>>>> Plug-in (parse-html)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    Pdf Parse
>>>> Plug-in (parse-pdf)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    Site Query
>>>> Filter (query-site)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    Jakarta POI -
>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    Text Parse
>>>> Plug-in (parse-text)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    MSWord Parse
>>>> Plug-in (parse-msword)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    SMB Protocol
>>>> Plug-in (protocol-smb)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    MSExcel Parse
>>>> Plug-in (parse-msexcel)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    OPIC Scoring
>>>> Plug-in (scoring-opic)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    CyberNeko HTML
>>>> Parser (lib-nekohtml)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    Log4j
>>>> (lib-log4j)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    File Protocol
>>>> Plug-in (protocol-file)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    URL Query
>>>> Filter (query-url)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    Parse MS
>>>> Documents Framework (lib-parsems)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
>>>> Extension-Points:
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    Nutch
>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    Nutch URL
>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    Nutch Protocol
>>>> (org.apache.nutch.protocol.Protocol)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    Nutch Analysis
>>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    Nutch URL
>>>> Filter (org.apache.nutch.net.URLFilter)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    Nutch Indexing
>>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    Nutch Online
>>>> Search Results Clustering Plugin
>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    HTML Parse
>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    Nutch Content
>>>> Parser (org.apache.nutch.parse.Parser)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    Nutch Scoring
>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    Nutch Query
>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository -    Ontology Model
>>>> Loader (org.apache.nutch.ontology.Ontology)
>>>> 2007-05-24 14:04:20,796 WARN  crawl.PartitionUrlByHost - Malformed URL:
>>>> 'smb://sql1/Sales/DATA/'
>>>> 2007-05-24 14:04:20,843 INFO  plugin.PluginRepository - Plugins:
>>>> looking in: C:\nutch-0.9\plugins
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Plugin
>>>> Auto-activation mode: [true]
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
>>>> Plugins:
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    the nutch core
>>>> extension points (nutch-extensionpoints)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    MSPowerPoint
>>>> Parse Plug-in (parse-mspowerpoint)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    Basic Query
>>>> Filter (query-basic)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    Basic Indexing
>>>> Filter (index-basic)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    Html Parse
>>>> Plug-in (parse-html)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    Pdf Parse
>>>> Plug-in (parse-pdf)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    Site Query
>>>> Filter (query-site)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    Jakarta POI -
>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    Text Parse
>>>> Plug-in (parse-text)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    MSWord Parse
>>>> Plug-in (parse-msword)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    SMB Protocol
>>>> Plug-in (protocol-smb)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    MSExcel Parse
>>>> Plug-in (parse-msexcel)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    OPIC Scoring
>>>> Plug-in (scoring-opic)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    CyberNeko HTML
>>>> Parser (lib-nekohtml)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    Log4j
>>>> (lib-log4j)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    File Protocol
>>>> Plug-in (protocol-file)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    URL Query
>>>> Filter (query-url)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    Parse MS
>>>> Documents Framework (lib-parsems)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
>>>> Extension-Points:
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    Nutch
>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    Nutch URL
>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    Nutch Protocol
>>>> (org.apache.nutch.protocol.Protocol)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    Nutch Analysis
>>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    Nutch URL
>>>> Filter (org.apache.nutch.net.URLFilter)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    Nutch Indexing
>>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    Nutch Online
>>>> Search Results Clustering Plugin
>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    HTML Parse
>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    Nutch Content
>>>> Parser (org.apache.nutch.parse.Parser)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    Nutch Scoring
>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    Nutch Query
>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository -    Ontology Model
>>>> Loader (org.apache.nutch.ontology.Ontology)
>>>> 2007-05-24 14:04:21,578 INFO  crawl.Generator - Generator: Partitioning
>>>> selected urls by host, for politeness.
>>>> 2007-05-24 14:04:21,593 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:21,828 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:21,859 INFO  plugin.PluginRepository - Plugins:
>>>> looking in: C:\nutch-0.9\plugins
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Plugin
>>>> Auto-activation mode: [true]
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
>>>> Plugins:
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    the nutch core
>>>> extension points (nutch-extensionpoints)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    MSPowerPoint
>>>> Parse Plug-in (parse-mspowerpoint)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    Basic Query
>>>> Filter (query-basic)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    Basic Indexing
>>>> Filter (index-basic)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    Html Parse
>>>> Plug-in (parse-html)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    Pdf Parse
>>>> Plug-in (parse-pdf)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    Site Query
>>>> Filter (query-site)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    Jakarta POI -
>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    Text Parse
>>>> Plug-in (parse-text)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    MSWord Parse
>>>> Plug-in (parse-msword)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    SMB Protocol
>>>> Plug-in (protocol-smb)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    MSExcel Parse
>>>> Plug-in (parse-msexcel)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    OPIC Scoring
>>>> Plug-in (scoring-opic)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    CyberNeko HTML
>>>> Parser (lib-nekohtml)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    Log4j
>>>> (lib-log4j)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    File Protocol
>>>> Plug-in (protocol-file)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    URL Query
>>>> Filter (query-url)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    Parse MS
>>>> Documents Framework (lib-parsems)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
>>>> Extension-Points:
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    Nutch
>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    Nutch URL
>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    Nutch Protocol
>>>> (org.apache.nutch.protocol.Protocol)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    Nutch Analysis
>>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    Nutch URL
>>>> Filter (org.apache.nutch.net.URLFilter)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    Nutch Indexing
>>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    Nutch Online
>>>> Search Results Clustering Plugin
>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    HTML Parse
>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    Nutch Content
>>>> Parser (org.apache.nutch.parse.Parser)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    Nutch Scoring
>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    Nutch Query
>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository -    Ontology Model
>>>> Loader (org.apache.nutch.ontology.Ontology)
>>>> 2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
>>>> 'smb://sql1/Sales/DATA/'
>>>> 2007-05-24 14:04:22,843 INFO  crawl.Generator - Generator: done.
>>>> 2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: starting
>>>> 2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: segment:
>>>> crawl/segments/20070524140420
>>>> 2007-05-24 14:04:22,859 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:23,156 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:23,187 INFO  fetcher.Fetcher - Fetcher: threads: 10
>>>> 2007-05-24 14:04:23,203 INFO  plugin.PluginRepository - Plugins:
>>>> looking in: C:\nutch-0.9\plugins
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Plugin
>>>> Auto-activation mode: [true]
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
>>>> Plugins:
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    the nutch core
>>>> extension points (nutch-extensionpoints)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    MSPowerPoint
>>>> Parse Plug-in (parse-mspowerpoint)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    Basic Query
>>>> Filter (query-basic)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    Basic Indexing
>>>> Filter (index-basic)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    Html Parse
>>>> Plug-in (parse-html)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    Pdf Parse
>>>> Plug-in (parse-pdf)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    Site Query
>>>> Filter (query-site)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    Jakarta POI -
>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    Text Parse
>>>> Plug-in (parse-text)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    MSWord Parse
>>>> Plug-in (parse-msword)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    SMB Protocol
>>>> Plug-in (protocol-smb)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    MSExcel Parse
>>>> Plug-in (parse-msexcel)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    OPIC Scoring
>>>> Plug-in (scoring-opic)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    CyberNeko HTML
>>>> Parser (lib-nekohtml)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    Log4j
>>>> (lib-log4j)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    File Protocol
>>>> Plug-in (protocol-file)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    URL Query
>>>> Filter (query-url)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    Parse MS
>>>> Documents Framework (lib-parsems)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
>>>> Extension-Points:
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    Nutch
>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    Nutch URL
>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    Nutch Protocol
>>>> (org.apache.nutch.protocol.Protocol)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    Nutch Analysis
>>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    Nutch URL
>>>> Filter (org.apache.nutch.net.URLFilter)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    Nutch Indexing
>>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    Nutch Online
>>>> Search Results Clustering Plugin
>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    HTML Parse
>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    Nutch Content
>>>> Parser (org.apache.nutch.parse.Parser)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    Nutch Scoring
>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    Nutch Query
>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository -    Ontology Model
>>>> Loader (org.apache.nutch.ontology.Ontology)
>>>> 2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetching
>>>> smb://sql1/Sales/DATA/
>>>> 2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetch of
>>>> smb://sql1/Sales/DATA/ failed with:
>>>> org.apache.nutch.protocol.ProtocolNotFound:
>>>> java.net.MalformedURLException: unknown protocol: smb
>>>> 2007-05-24 14:04:23,500 INFO  fetcher.Fetcher - fetching
>>>> file:///C:/Policies/
>>>> 2007-05-24 14:04:23,718 INFO  crawl.SignatureFactory - Using Signature
>>>> impl: org.apache.nutch.crawl.MD5Signature
>>>> 2007-05-24 14:04:24,671 INFO  plugin.PluginRepository - Plugins:
>>>> looking in: C:\nutch-0.9\plugins
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Plugin
>>>> Auto-activation mode: [true]
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
>>>> Plugins:
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    the nutch core
>>>> extension points (nutch-extensionpoints)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    MSPowerPoint
>>>> Parse Plug-in (parse-mspowerpoint)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    Basic Query
>>>> Filter (query-basic)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    Basic Indexing
>>>> Filter (index-basic)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    Html Parse
>>>> Plug-in (parse-html)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    Pdf Parse
>>>> Plug-in (parse-pdf)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    Site Query
>>>> Filter (query-site)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    Jakarta POI -
>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    Text Parse
>>>> Plug-in (parse-text)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    MSWord Parse
>>>> Plug-in (parse-msword)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    SMB Protocol
>>>> Plug-in (protocol-smb)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    MSExcel Parse
>>>> Plug-in (parse-msexcel)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    OPIC Scoring
>>>> Plug-in (scoring-opic)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    CyberNeko HTML
>>>> Parser (lib-nekohtml)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    Log4j
>>>> (lib-log4j)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    File Protocol
>>>> Plug-in (protocol-file)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    URL Query
>>>> Filter (query-url)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    Parse MS
>>>> Documents Framework (lib-parsems)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
>>>> Extension-Points:
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    Nutch
>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    Nutch URL
>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    Nutch Protocol
>>>> (org.apache.nutch.protocol.Protocol)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    Nutch Analysis
>>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    Nutch URL
>>>> Filter (org.apache.nutch.net.URLFilter)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    Nutch Indexing
>>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    Nutch Online
>>>> Search Results Clustering Plugin
>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    HTML Parse
>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    Nutch Content
>>>> Parser (org.apache.nutch.parse.Parser)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    Nutch Scoring
>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    Nutch Query
>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository -    Ontology Model
>>>> Loader (org.apache.nutch.ontology.Ontology)
>>>> 2007-05-24 14:04:25,171 INFO  fetcher.Fetcher - Fetcher: done
>>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: starting
>>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: db:
>>>> crawl/crawldb
>>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: segments:
>>>> [crawl/segments/20070524140420]
>>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: additions
>>>> allowed: true
>>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
>>>> normalizing: true
>>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
>>>> filtering: true
>>>> 2007-05-24 14:04:25,203 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:25,203 INFO  crawl.CrawlDb - CrawlDb update: Merging
>>>> segment data into db.
>>>> 2007-05-24 14:04:25,421 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:25,468 INFO  plugin.PluginRepository - Plugins:
>>>> looking in: C:\nutch-0.9\plugins
>>>> 2007-05-24 14:04:25,593 INFO  plugin.PluginRepository - Plugin
>>>> Auto-activation mode: [true]
>>>> 
>>>> 
>>>> Thank you for reading my post, hope you can help.
>>>> 
>>>> Regards,
>>>> 
>>>> Oli
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/WIN-XP-PRO--Djava.protocol*-file%3A---c%3A-folder--Crawling-Parents-tf3809966.html#a10970245
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to