Hi Vadim,
To be honest I am somewhat behind you as my problem is that I cannot get the
SMB protocol setup, I am unable to get the -djava bit to do anything, I am
using cygwin and entering the command from within sun\java etc.
As for crawl speed, I'd love to get that far.
Also I noticed that you were crawling from the root of C:\ whereas I want to
crawl a specific folder and the parent directory issue crops up, I cannot
get it to stop crawling the parent. One thing I had noticed is that I did
not have a URLFILTER entry in my nucth-config.xml and that makes a
difference in that if I try to set it up as in the tutorial it won't crawl a
thing??!!
Sorry I cannot be of help but I feel somewhat behind you in terms of Nutch
dev, I am thinking of trying Nutch using ver 8 instead of 9 as there is more
documented on it although I have read that it is slow, half the speed of ver
9 in terms of crawl speed, are you using ver 8?
Regards,
Oli
Vadim B wrote:
>
> Could you solve the problem?
>
> I get about 800kb/s as transfer speed wich is not so fast to use it in
> productiv enviroment, what about you?
>
>
>
> opoole wrote:
>>
>> Sorry Vadim,
>>
>> I did not realise you had sent me the email [Doh!].
>>
>>
>> Vadim B wrote:
>>>
>>> Hi,
>>>
>>> I am working on the same issue as you, So far I could crawl file:///C:/*
>>> but i am stucked on the smb part. It looks to me that this plugin isn't
>>> working properly so it needs to be fixed for the newer version of nutch.
>>>
>>> The error I get differs a bit from yours it is:
>>>
>>> 2007-05-25 18:06:29,573 INFO fetcher.Fetcher - fetching
>>> smb://mobidick/test/
>>> 2007-05-25 18:06:29,573 INFO fetcher.Fetcher - fetch of
>>> smb://mobidick/test/ failed with:
>>> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
>>> url=smb
>>>
>>> I will dive into the plugin-smb and try out to narrow the problem Maybe
>>> we can work together to get a quick solution.
>>>
>>>
>>>
>>> ---SNIP---
>>>
>>> # accept hosts in MY.DOMAIN.NAME
>>> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>>> +^file:///C:/Policies/ <<-- why you put it here it doesn't make sense
>>> because the +^(file|smb) line above is already fitting so this will be
>>> skipped
>>> ---SNIP ---
>>>
>>> ---SNIP ---
>>> 2007-05-24 14:04:22,000 WARN crawl.PartitionUrlByHost - Malformed URL:
>>> 'smb://sql1/Sales/DATA/'
>>> //did you cuoted the url or is it displayed in the logs like this? I
>>> dont get this error
>>> ---SNIP ---
>>>
>>> try this in package org.apache.nutch.crawl.Crawl
>>>
>>> public static void main(String args[]) throws Exception {
>>> System.setProperty("java.protocol.handler.pkgs", "jcifs"); // new
>>> LOG.info("SMB Info: " +
>>> System.getProperty("java.protocol.handler.pkgs")); //new
>>> LOG.info("SMB Info: " + new
>>> java.util.PropertyPermission("java.protocol.handler.pkgs","read,
>>> write").toString());//new
>>> if (args.length < 1) {
>>> System.out.println
>>> ("Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN
>>> N]");
>>> return;
>>> }
>>> ---SNIP---
>>>
>>> check out this:
>>> http://java.sun.com/developer/onlineTraining/protocolhandlers/
>>>
>>>
>>>
>>>
>>>
>>> opoole wrote:
>>>>
>>>> Hi All, I hope you can help as I am becomming rather depressed with
>>>> Nutch on Windows.
>>>>
>>>> Using: Windows XP Pro SP2 - Nutch-0.9 - Cygwin [current version from
>>>> cygwin site] - Java JDK 1.6.0 - Java Platform Standard Edition 1.6.0
>>>>
>>>> I cannot stop Nutch from crawling parent directories, I have looked at
>>>> other threads and none seem to work.
>>>>
>>>> I have tried to include protocol-smb [jcifs] but Cygwin keeps prompting
>>>> for Java syntax corrections.
>>>>
>>>> Below I have listed my configurations along with the command I type in
>>>> cygwin for jcifs:
>>>>
>>>> CRAWL-URLFILTER
>>>> # The url filter file used by the crawl command.
>>>>
>>>> # Better for intranet crawling.
>>>> # Be sure to change MY.DOMAIN.NAME to your domain name.
>>>>
>>>> # Each non-comment, non-blank line contains a regular expression
>>>> # prefixed by '+' or '-'. The first matching pattern in the file
>>>> # determines whether a URL is included or ignored. If no pattern
>>>> # matches, the URL is ignored.
>>>>
>>>> # skip file:, ftp:, & mailto: urls
>>>> -^(http|ftp|mailto):
>>>> +^(file|smb):
>>>>
>>>> # skip image and other suffixes we can't yet parse
>>>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>>>>
>>>> # skip URLs containing certain characters as probable queries, etc.
>>>>
>>>> # skip URLs with slash-delimited segment that repeats 3+ times, to
>>>> break
>>>> loops
>>>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>>>
>>>> # accept hosts in MY.DOMAIN.NAME
>>>> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>>>> +^file:///C:/Policies/ <<-- why you put it here it doesnt make sese
>>>> because the +^(file|smb) is already fitting !
>>>>
>>>> # skip everything else
>>>> -.
>>>>
>>>> NUTCH-SITE
>>>>
>>>> <?xml version="1.0"?>
>>>> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
>>>> <!-- Put site-specific property overrides in this file. -->
>>>>
>>>> <nutch-conf>
>>>>
>>>> <property>
>>>> <name>http.agent.name</name>
>>>> <value>pascall</value>
>>>> <description></description>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>file.content.limit</name>
>>>> <value>-1</value>
>>>> <description>The length limit for downloaded content, in bytes.
>>>> If this value is nonnegative (>=0), content longer than it will be
>>>> truncated;
>>>> otherwise, no truncation at all.
>>>> </description>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>file.crawl.parent</name>
>>>> <value>false</value>
>>>> <description>The crawler is not restricted to the directories that
>>>> you specified in the
>>>> Urls file but it is jumping into the parent directories as well.
>>>> For your own crawlings you can
>>>> change this bahavior (set to false) the way that only directories
>>>> beneath the directories that you specify get
>>>> crawled.</description>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>plugin.includes</name>
>>>> <value>protocol-file|protocol-smb|scoring-opic|parse-(msexcel|mspowerpoint|msword|xml|text|html|pdf)|index-basic|query-(basic|site|url)</value>
>>>> </property>
>>>>
>>>> </nutch-conf>
>>>>
>>>> CYGWIN
>>>>
>>>> Using cygwin I enter the command from C:\Sun\Java\jdk160\bin\
>>>>
>>>> java -Djava.protocol.handler.pkgs=jcifs
>>>>
>>>> When I press return the cygwin shell displays a list of java commands
>>>> as though I am using incorrect syntax.
>>>>
>>>> Dump of Crawl from Cygwin:
>>>>
>>>> 2007-05-24 14:04:16,140 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:16,171 INFO crawl.Crawl - crawl started in: crawl
>>>> 2007-05-24 14:04:16,171 INFO crawl.Crawl - rootUrlDir = urls.txt
>>>> 2007-05-24 14:04:16,171 INFO crawl.Crawl - threads = 10
>>>> 2007-05-24 14:04:16,171 INFO crawl.Crawl - depth = 5
>>>> 2007-05-24 14:04:16,281 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:16,281 INFO crawl.Injector - Injector: starting
>>>> 2007-05-24 14:04:16,281 INFO crawl.Injector - Injector: crawlDb:
>>>> crawl/crawldb
>>>> 2007-05-24 14:04:16,296 INFO crawl.Injector - Injector: urlDir:
>>>> urls.txt
>>>> 2007-05-24 14:04:16,296 INFO crawl.Injector - Injector: Converting
>>>> injected urls to crawl db entries.
>>>> 2007-05-24 14:04:16,328 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:16,843 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:16,953 INFO plugin.PluginRepository - Plugins:
>>>> looking in: C:\nutch-0.9\plugins
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Plugin
>>>> Auto-activation mode: [true]
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Registered
>>>> Plugins:
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - the nutch core
>>>> extension points (nutch-extensionpoints)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - MSPowerPoint
>>>> Parse Plug-in (parse-mspowerpoint)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Basic Query
>>>> Filter (query-basic)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Basic Indexing
>>>> Filter (index-basic)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Html Parse
>>>> Plug-in (parse-html)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Pdf Parse
>>>> Plug-in (parse-pdf)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Site Query
>>>> Filter (query-site)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Jakarta POI -
>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Text Parse
>>>> Plug-in (parse-text)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - MSWord Parse
>>>> Plug-in (parse-msword)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - SMB Protocol
>>>> Plug-in (protocol-smb)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - MSExcel Parse
>>>> Plug-in (parse-msexcel)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - OPIC Scoring
>>>> Plug-in (scoring-opic)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - CyberNeko HTML
>>>> Parser (lib-nekohtml)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Log4j
>>>> (lib-log4j)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - File Protocol
>>>> Plug-in (protocol-file)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - URL Query
>>>> Filter (query-url)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Parse MS
>>>> Documents Framework (lib-parsems)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Registered
>>>> Extension-Points:
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch
>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch URL
>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Protocol
>>>> (org.apache.nutch.protocol.Protocol)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Analysis
>>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch URL
>>>> Filter (org.apache.nutch.net.URLFilter)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Indexing
>>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Online
>>>> Search Results Clustering Plugin
>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - HTML Parse
>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Content
>>>> Parser (org.apache.nutch.parse.Parser)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Scoring
>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Query
>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Ontology Model
>>>> Loader (org.apache.nutch.ontology.Ontology)
>>>> 2007-05-24 14:04:17,875 INFO crawl.Injector - Injector: Merging
>>>> injected urls into crawl db.
>>>> 2007-05-24 14:04:17,906 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:18,156 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:18,375 WARN util.NativeCodeLoader - Unable to load
>>>> native-hadoop library for your platform... using builtin-java classes
>>>> where applicable
>>>> 2007-05-24 14:04:19,281 INFO crawl.Injector - Injector: done
>>>> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: Selecting
>>>> best-scoring urls due for fetch.
>>>> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: starting
>>>> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: segment:
>>>> crawl/segments/20070524140420
>>>> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: filtering:
>>>> false
>>>> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: topN:
>>>> 2147483647
>>>> 2007-05-24 14:04:20,312 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:20,312 INFO crawl.Generator - Generator: jobtracker
>>>> is 'local', generating exactly one partition.
>>>> 2007-05-24 14:04:20,562 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:20,609 INFO plugin.PluginRepository - Plugins:
>>>> looking in: C:\nutch-0.9\plugins
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Plugin
>>>> Auto-activation mode: [true]
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Registered
>>>> Plugins:
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - the nutch core
>>>> extension points (nutch-extensionpoints)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - MSPowerPoint
>>>> Parse Plug-in (parse-mspowerpoint)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Basic Query
>>>> Filter (query-basic)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Basic Indexing
>>>> Filter (index-basic)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Html Parse
>>>> Plug-in (parse-html)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Pdf Parse
>>>> Plug-in (parse-pdf)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Site Query
>>>> Filter (query-site)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Jakarta POI -
>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Text Parse
>>>> Plug-in (parse-text)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - MSWord Parse
>>>> Plug-in (parse-msword)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - SMB Protocol
>>>> Plug-in (protocol-smb)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - MSExcel Parse
>>>> Plug-in (parse-msexcel)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - OPIC Scoring
>>>> Plug-in (scoring-opic)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - CyberNeko HTML
>>>> Parser (lib-nekohtml)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Log4j
>>>> (lib-log4j)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - File Protocol
>>>> Plug-in (protocol-file)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - URL Query
>>>> Filter (query-url)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Parse MS
>>>> Documents Framework (lib-parsems)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Registered
>>>> Extension-Points:
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch
>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch URL
>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Protocol
>>>> (org.apache.nutch.protocol.Protocol)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Analysis
>>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch URL
>>>> Filter (org.apache.nutch.net.URLFilter)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Indexing
>>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Online
>>>> Search Results Clustering Plugin
>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - HTML Parse
>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Content
>>>> Parser (org.apache.nutch.parse.Parser)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Scoring
>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Query
>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Ontology Model
>>>> Loader (org.apache.nutch.ontology.Ontology)
>>>> 2007-05-24 14:04:20,796 WARN crawl.PartitionUrlByHost - Malformed URL:
>>>> 'smb://sql1/Sales/DATA/'
>>>> 2007-05-24 14:04:20,843 INFO plugin.PluginRepository - Plugins:
>>>> looking in: C:\nutch-0.9\plugins
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Plugin
>>>> Auto-activation mode: [true]
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Registered
>>>> Plugins:
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - the nutch core
>>>> extension points (nutch-extensionpoints)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - MSPowerPoint
>>>> Parse Plug-in (parse-mspowerpoint)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Basic Query
>>>> Filter (query-basic)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Basic Indexing
>>>> Filter (index-basic)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Html Parse
>>>> Plug-in (parse-html)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Pdf Parse
>>>> Plug-in (parse-pdf)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Site Query
>>>> Filter (query-site)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Jakarta POI -
>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Text Parse
>>>> Plug-in (parse-text)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - MSWord Parse
>>>> Plug-in (parse-msword)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - SMB Protocol
>>>> Plug-in (protocol-smb)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - MSExcel Parse
>>>> Plug-in (parse-msexcel)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - OPIC Scoring
>>>> Plug-in (scoring-opic)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - CyberNeko HTML
>>>> Parser (lib-nekohtml)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Log4j
>>>> (lib-log4j)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - File Protocol
>>>> Plug-in (protocol-file)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - URL Query
>>>> Filter (query-url)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Parse MS
>>>> Documents Framework (lib-parsems)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Registered
>>>> Extension-Points:
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch
>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch URL
>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Protocol
>>>> (org.apache.nutch.protocol.Protocol)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Analysis
>>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch URL
>>>> Filter (org.apache.nutch.net.URLFilter)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Indexing
>>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Online
>>>> Search Results Clustering Plugin
>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - HTML Parse
>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Content
>>>> Parser (org.apache.nutch.parse.Parser)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Scoring
>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Query
>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Ontology Model
>>>> Loader (org.apache.nutch.ontology.Ontology)
>>>> 2007-05-24 14:04:21,578 INFO crawl.Generator - Generator: Partitioning
>>>> selected urls by host, for politeness.
>>>> 2007-05-24 14:04:21,593 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:21,828 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:21,859 INFO plugin.PluginRepository - Plugins:
>>>> looking in: C:\nutch-0.9\plugins
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Plugin
>>>> Auto-activation mode: [true]
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Registered
>>>> Plugins:
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - the nutch core
>>>> extension points (nutch-extensionpoints)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - MSPowerPoint
>>>> Parse Plug-in (parse-mspowerpoint)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Basic Query
>>>> Filter (query-basic)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Basic Indexing
>>>> Filter (index-basic)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Html Parse
>>>> Plug-in (parse-html)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Pdf Parse
>>>> Plug-in (parse-pdf)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Site Query
>>>> Filter (query-site)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Jakarta POI -
>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Text Parse
>>>> Plug-in (parse-text)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - MSWord Parse
>>>> Plug-in (parse-msword)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - SMB Protocol
>>>> Plug-in (protocol-smb)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - MSExcel Parse
>>>> Plug-in (parse-msexcel)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - OPIC Scoring
>>>> Plug-in (scoring-opic)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - CyberNeko HTML
>>>> Parser (lib-nekohtml)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Log4j
>>>> (lib-log4j)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - File Protocol
>>>> Plug-in (protocol-file)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - URL Query
>>>> Filter (query-url)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Parse MS
>>>> Documents Framework (lib-parsems)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Registered
>>>> Extension-Points:
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch
>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch URL
>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Protocol
>>>> (org.apache.nutch.protocol.Protocol)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Analysis
>>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch URL
>>>> Filter (org.apache.nutch.net.URLFilter)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Indexing
>>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Online
>>>> Search Results Clustering Plugin
>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - HTML Parse
>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Content
>>>> Parser (org.apache.nutch.parse.Parser)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Scoring
>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Query
>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Ontology Model
>>>> Loader (org.apache.nutch.ontology.Ontology)
>>>> 2007-05-24 14:04:22,000 WARN crawl.PartitionUrlByHost - Malformed URL:
>>>> 'smb://sql1/Sales/DATA/'
>>>> 2007-05-24 14:04:22,843 INFO crawl.Generator - Generator: done.
>>>> 2007-05-24 14:04:22,843 INFO fetcher.Fetcher - Fetcher: starting
>>>> 2007-05-24 14:04:22,843 INFO fetcher.Fetcher - Fetcher: segment:
>>>> crawl/segments/20070524140420
>>>> 2007-05-24 14:04:22,859 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:23,156 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:23,187 INFO fetcher.Fetcher - Fetcher: threads: 10
>>>> 2007-05-24 14:04:23,203 INFO plugin.PluginRepository - Plugins:
>>>> looking in: C:\nutch-0.9\plugins
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Plugin
>>>> Auto-activation mode: [true]
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Registered
>>>> Plugins:
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - the nutch core
>>>> extension points (nutch-extensionpoints)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - MSPowerPoint
>>>> Parse Plug-in (parse-mspowerpoint)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Basic Query
>>>> Filter (query-basic)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Basic Indexing
>>>> Filter (index-basic)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Html Parse
>>>> Plug-in (parse-html)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Pdf Parse
>>>> Plug-in (parse-pdf)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Site Query
>>>> Filter (query-site)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Jakarta POI -
>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Text Parse
>>>> Plug-in (parse-text)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - MSWord Parse
>>>> Plug-in (parse-msword)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - SMB Protocol
>>>> Plug-in (protocol-smb)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - MSExcel Parse
>>>> Plug-in (parse-msexcel)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - OPIC Scoring
>>>> Plug-in (scoring-opic)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - CyberNeko HTML
>>>> Parser (lib-nekohtml)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Log4j
>>>> (lib-log4j)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - File Protocol
>>>> Plug-in (protocol-file)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - URL Query
>>>> Filter (query-url)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Parse MS
>>>> Documents Framework (lib-parsems)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Registered
>>>> Extension-Points:
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch
>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch URL
>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Protocol
>>>> (org.apache.nutch.protocol.Protocol)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Analysis
>>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch URL
>>>> Filter (org.apache.nutch.net.URLFilter)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Indexing
>>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Online
>>>> Search Results Clustering Plugin
>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - HTML Parse
>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Content
>>>> Parser (org.apache.nutch.parse.Parser)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Scoring
>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Query
>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Ontology Model
>>>> Loader (org.apache.nutch.ontology.Ontology)
>>>> 2007-05-24 14:04:23,390 INFO fetcher.Fetcher - fetching
>>>> smb://sql1/Sales/DATA/
>>>> 2007-05-24 14:04:23,390 INFO fetcher.Fetcher - fetch of
>>>> smb://sql1/Sales/DATA/ failed with:
>>>> org.apache.nutch.protocol.ProtocolNotFound:
>>>> java.net.MalformedURLException: unknown protocol: smb
>>>> 2007-05-24 14:04:23,500 INFO fetcher.Fetcher - fetching
>>>> file:///C:/Policies/
>>>> 2007-05-24 14:04:23,718 INFO crawl.SignatureFactory - Using Signature
>>>> impl: org.apache.nutch.crawl.MD5Signature
>>>> 2007-05-24 14:04:24,671 INFO plugin.PluginRepository - Plugins:
>>>> looking in: C:\nutch-0.9\plugins
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Plugin
>>>> Auto-activation mode: [true]
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Registered
>>>> Plugins:
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - the nutch core
>>>> extension points (nutch-extensionpoints)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - MSPowerPoint
>>>> Parse Plug-in (parse-mspowerpoint)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Basic Query
>>>> Filter (query-basic)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Basic Indexing
>>>> Filter (index-basic)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Html Parse
>>>> Plug-in (parse-html)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Pdf Parse
>>>> Plug-in (parse-pdf)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Site Query
>>>> Filter (query-site)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Jakarta POI -
>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Text Parse
>>>> Plug-in (parse-text)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - MSWord Parse
>>>> Plug-in (parse-msword)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - SMB Protocol
>>>> Plug-in (protocol-smb)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - MSExcel Parse
>>>> Plug-in (parse-msexcel)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - OPIC Scoring
>>>> Plug-in (scoring-opic)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - CyberNeko HTML
>>>> Parser (lib-nekohtml)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Log4j
>>>> (lib-log4j)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - File Protocol
>>>> Plug-in (protocol-file)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - URL Query
>>>> Filter (query-url)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Parse MS
>>>> Documents Framework (lib-parsems)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Registered
>>>> Extension-Points:
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch
>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch URL
>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Protocol
>>>> (org.apache.nutch.protocol.Protocol)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Analysis
>>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch URL
>>>> Filter (org.apache.nutch.net.URLFilter)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Indexing
>>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Online
>>>> Search Results Clustering Plugin
>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - HTML Parse
>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Content
>>>> Parser (org.apache.nutch.parse.Parser)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Scoring
>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Query
>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Ontology Model
>>>> Loader (org.apache.nutch.ontology.Ontology)
>>>> 2007-05-24 14:04:25,171 INFO fetcher.Fetcher - Fetcher: done
>>>> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: starting
>>>> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: db:
>>>> crawl/crawldb
>>>> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: segments:
>>>> [crawl/segments/20070524140420]
>>>> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: additions
>>>> allowed: true
>>>> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: URL
>>>> normalizing: true
>>>> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: URL
>>>> filtering: true
>>>> 2007-05-24 14:04:25,203 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:25,203 INFO crawl.CrawlDb - CrawlDb update: Merging
>>>> segment data into db.
>>>> 2007-05-24 14:04:25,421 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:25,468 INFO plugin.PluginRepository - Plugins:
>>>> looking in: C:\nutch-0.9\plugins
>>>> 2007-05-24 14:04:25,593 INFO plugin.PluginRepository - Plugin
>>>> Auto-activation mode: [true]
>>>>
>>>>
>>>> Thank you for reading my post, hope you can help.
>>>>
>>>> Regards,
>>>>
>>>> Oli
>>>>
>>>
>>>
>>
>>
>
>
--
View this message in context:
http://www.nabble.com/WIN-XP-PRO--Djava.protocol*-file%3A---c%3A-folder--Crawling-Parents-tf3809966.html#a10970245
Sent from the Nutch - User mailing list archive at Nabble.com.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general