Hi,
I am working on the same issue as you, So far I could crawl file:///C:/* but
i am stucked on the smb part. It looks to me that this plugin isn't working
properly so it needs to be fixed for the newer version of nutch.
The error I get differs a bit from yours it is:
2007-05-25 18:06:29,573 INFO fetcher.Fetcher - fetching
smb://mobidick/test/
2007-05-25 18:06:29,573 INFO fetcher.Fetcher - fetch of
smb://mobidick/test/ failed with:
org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=smb
I will dive into the plugin-smb and try out to narrow the problem Maybe we
can work together to get a quick solution.
---SNIP---
# accept hosts in MY.DOMAIN.NAME
# Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^file:///C:/Policies/ <<-- why you put it here it doesn't make sense
because the +^(file|smb) line above is already fitting so this will be
skipped
---SNIP ---
---SNIP ---
2007-05-24 14:04:22,000 WARN crawl.PartitionUrlByHost - Malformed URL:
'smb://sql1/Sales/DATA/'
//did you cuoted the url or is it displayed in the logs like this? I dont
get this error
---SNIP ---
try this in package org.apache.nutch.crawl.Crawl
public static void main(String args[]) throws Exception {
System.setProperty("java.protocol.handler.pkgs", "jcifs"); // new
LOG.info("SMB Info: " +
System.getProperty("java.protocol.handler.pkgs")); //new
LOG.info("SMB Info: " + new
java.util.PropertyPermission("java.protocol.handler.pkgs","read,
write").toString());//new
if (args.length < 1) {
System.out.println
("Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN
N]");
return;
}
---SNIP---
check out this:
http://java.sun.com/developer/onlineTraining/protocolhandlers/
opoole wrote:
>
> Hi All, I hope you can help as I am becomming rather depressed with Nutch
> on Windows.
>
> Using: Windows XP Pro SP2 - Nutch-0.9 - Cygwin [current version from
> cygwin site] - Java JDK 1.6.0 - Java Platform Standard Edition 1.6.0
>
> I cannot stop Nutch from crawling parent directories, I have looked at
> other threads and none seem to work.
>
> I have tried to include protocol-smb [jcifs] but Cygwin keeps prompting
> for Java syntax corrections.
>
> Below I have listed my configurations along with the command I type in
> cygwin for jcifs:
>
> CRAWL-URLFILTER
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'. The first matching pattern in the file
> # determines whether a URL is included or ignored. If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(http|ftp|mailto):
> +^(file|smb):
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
> # accept hosts in MY.DOMAIN.NAME
> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> +^file:///C:/Policies/ <<-- why you put it here it doesnt make sese
> because the +^(file|smb) is already fitting !
>
> # skip everything else
> -.
>
> NUTCH-SITE
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
>
> <nutch-conf>
>
> <property>
> <name>http.agent.name</name>
> <value>pascall</value>
> <description></description>
> </property>
>
> <property>
> <name>file.content.limit</name>
> <value>-1</value>
> <description>The length limit for downloaded content, in bytes.
> If this value is nonnegative (>=0), content longer than it will be
> truncated;
> otherwise, no truncation at all.
> </description>
> </property>
>
> <property>
> <name>file.crawl.parent</name>
> <value>false</value>
> <description>The crawler is not restricted to the directories that you
> specified in the
> Urls file but it is jumping into the parent directories as well. For
> your own crawlings you can
> change this bahavior (set to false) the way that only directories
> beneath the directories that you specify get
> crawled.</description>
> </property>
>
> <property>
> <name>plugin.includes</name>
> <value>protocol-file|protocol-smb|scoring-opic|parse-(msexcel|mspowerpoint|msword|xml|text|html|pdf)|index-basic|query-(basic|site|url)</value>
> </property>
>
> </nutch-conf>
>
> CYGWIN
>
> Using cygwin I enter the command from C:\Sun\Java\jdk160\bin\
>
> java -Djava.protocol.handler.pkgs=jcifs
>
> When I press return the cygwin shell displays a list of java commands as
> though I am using incorrect syntax.
>
> Dump of Crawl from Cygwin:
>
> 2007-05-24 14:04:16,140 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:16,171 INFO crawl.Crawl - crawl started in: crawl
> 2007-05-24 14:04:16,171 INFO crawl.Crawl - rootUrlDir = urls.txt
> 2007-05-24 14:04:16,171 INFO crawl.Crawl - threads = 10
> 2007-05-24 14:04:16,171 INFO crawl.Crawl - depth = 5
> 2007-05-24 14:04:16,281 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:16,281 INFO crawl.Injector - Injector: starting
> 2007-05-24 14:04:16,281 INFO crawl.Injector - Injector: crawlDb:
> crawl/crawldb
> 2007-05-24 14:04:16,296 INFO crawl.Injector - Injector: urlDir: urls.txt
> 2007-05-24 14:04:16,296 INFO crawl.Injector - Injector: Converting
> injected urls to crawl db entries.
> 2007-05-24 14:04:16,328 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:16,843 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:16,953 INFO plugin.PluginRepository - Plugins: looking
> in: C:\nutch-0.9\plugins
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Registered
> Plugins:
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - the nutch core
> extension points (nutch-extensionpoints)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - MSPowerPoint
> Parse Plug-in (parse-mspowerpoint)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Basic Query
> Filter (query-basic)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Basic Indexing
> Filter (index-basic)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Html Parse
> Plug-in (parse-html)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Pdf Parse
> Plug-in
> (parse-pdf)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Site Query
> Filter
> (query-site)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Jakarta POI -
> Java API To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Text Parse
> Plug-in (parse-text)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - MSWord Parse
> Plug-in (parse-msword)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - SMB Protocol
> Plug-in (protocol-smb)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - MSExcel Parse
> Plug-in (parse-msexcel)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - OPIC Scoring
> Plug-in (scoring-opic)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Log4j
> (lib-log4j)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - File Protocol
> Plug-in (protocol-file)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - URL Query Filter
> (query-url)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Parse MS
> Documents Framework (lib-parsems)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Registered
> Extension-Points:
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Online
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - HTML Parse
> Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Ontology Model
> Loader (org.apache.nutch.ontology.Ontology)
> 2007-05-24 14:04:17,875 INFO crawl.Injector - Injector: Merging injected
> urls into crawl db.
> 2007-05-24 14:04:17,906 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:18,156 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:18,375 WARN util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes
> where applicable
> 2007-05-24 14:04:19,281 INFO crawl.Injector - Injector: done
> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: Selecting
> best-scoring urls due for fetch.
> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: starting
> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: segment:
> crawl/segments/20070524140420
> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: filtering:
> false
> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: topN:
> 2147483647
> 2007-05-24 14:04:20,312 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:20,312 INFO crawl.Generator - Generator: jobtracker is
> 'local', generating exactly one partition.
> 2007-05-24 14:04:20,562 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:20,609 INFO plugin.PluginRepository - Plugins: looking
> in: C:\nutch-0.9\plugins
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Registered
> Plugins:
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - the nutch core
> extension points (nutch-extensionpoints)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - MSPowerPoint
> Parse Plug-in (parse-mspowerpoint)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Basic Query
> Filter (query-basic)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Basic Indexing
> Filter (index-basic)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Html Parse
> Plug-in (parse-html)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Pdf Parse
> Plug-in
> (parse-pdf)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Site Query
> Filter
> (query-site)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Jakarta POI -
> Java API To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Text Parse
> Plug-in (parse-text)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - MSWord Parse
> Plug-in (parse-msword)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - SMB Protocol
> Plug-in (protocol-smb)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - MSExcel Parse
> Plug-in (parse-msexcel)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - OPIC Scoring
> Plug-in (scoring-opic)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Log4j
> (lib-log4j)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - File Protocol
> Plug-in (protocol-file)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - URL Query Filter
> (query-url)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Parse MS
> Documents Framework (lib-parsems)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Registered
> Extension-Points:
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Online
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - HTML Parse
> Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Ontology Model
> Loader (org.apache.nutch.ontology.Ontology)
> 2007-05-24 14:04:20,796 WARN crawl.PartitionUrlByHost - Malformed URL:
> 'smb://sql1/Sales/DATA/'
> 2007-05-24 14:04:20,843 INFO plugin.PluginRepository - Plugins: looking
> in: C:\nutch-0.9\plugins
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Registered
> Plugins:
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - the nutch core
> extension points (nutch-extensionpoints)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - MSPowerPoint
> Parse Plug-in (parse-mspowerpoint)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Basic Query
> Filter (query-basic)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Basic Indexing
> Filter (index-basic)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Html Parse
> Plug-in (parse-html)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Pdf Parse
> Plug-in
> (parse-pdf)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Site Query
> Filter
> (query-site)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Jakarta POI -
> Java API To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Text Parse
> Plug-in (parse-text)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - MSWord Parse
> Plug-in (parse-msword)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - SMB Protocol
> Plug-in (protocol-smb)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - MSExcel Parse
> Plug-in (parse-msexcel)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - OPIC Scoring
> Plug-in (scoring-opic)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Log4j
> (lib-log4j)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - File Protocol
> Plug-in (protocol-file)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - URL Query Filter
> (query-url)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Parse MS
> Documents Framework (lib-parsems)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Registered
> Extension-Points:
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Online
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - HTML Parse
> Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Ontology Model
> Loader (org.apache.nutch.ontology.Ontology)
> 2007-05-24 14:04:21,578 INFO crawl.Generator - Generator: Partitioning
> selected urls by host, for politeness.
> 2007-05-24 14:04:21,593 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:21,828 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:21,859 INFO plugin.PluginRepository - Plugins: looking
> in: C:\nutch-0.9\plugins
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Registered
> Plugins:
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - the nutch core
> extension points (nutch-extensionpoints)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - MSPowerPoint
> Parse Plug-in (parse-mspowerpoint)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Basic Query
> Filter (query-basic)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Basic Indexing
> Filter (index-basic)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Html Parse
> Plug-in (parse-html)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Pdf Parse
> Plug-in
> (parse-pdf)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Site Query
> Filter
> (query-site)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Jakarta POI -
> Java API To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Text Parse
> Plug-in (parse-text)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - MSWord Parse
> Plug-in (parse-msword)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - SMB Protocol
> Plug-in (protocol-smb)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - MSExcel Parse
> Plug-in (parse-msexcel)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - OPIC Scoring
> Plug-in (scoring-opic)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Log4j
> (lib-log4j)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - File Protocol
> Plug-in (protocol-file)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - URL Query Filter
> (query-url)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Parse MS
> Documents Framework (lib-parsems)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Registered
> Extension-Points:
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Online
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - HTML Parse
> Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Ontology Model
> Loader (org.apache.nutch.ontology.Ontology)
> 2007-05-24 14:04:22,000 WARN crawl.PartitionUrlByHost - Malformed URL:
> 'smb://sql1/Sales/DATA/'
> 2007-05-24 14:04:22,843 INFO crawl.Generator - Generator: done.
> 2007-05-24 14:04:22,843 INFO fetcher.Fetcher - Fetcher: starting
> 2007-05-24 14:04:22,843 INFO fetcher.Fetcher - Fetcher: segment:
> crawl/segments/20070524140420
> 2007-05-24 14:04:22,859 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:23,156 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:23,187 INFO fetcher.Fetcher - Fetcher: threads: 10
> 2007-05-24 14:04:23,203 INFO plugin.PluginRepository - Plugins: looking
> in: C:\nutch-0.9\plugins
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Registered
> Plugins:
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - the nutch core
> extension points (nutch-extensionpoints)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - MSPowerPoint
> Parse Plug-in (parse-mspowerpoint)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Basic Query
> Filter (query-basic)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Basic Indexing
> Filter (index-basic)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Html Parse
> Plug-in (parse-html)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Pdf Parse
> Plug-in
> (parse-pdf)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Site Query
> Filter
> (query-site)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Jakarta POI -
> Java API To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Text Parse
> Plug-in (parse-text)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - MSWord Parse
> Plug-in (parse-msword)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - SMB Protocol
> Plug-in (protocol-smb)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - MSExcel Parse
> Plug-in (parse-msexcel)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - OPIC Scoring
> Plug-in (scoring-opic)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Log4j
> (lib-log4j)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - File Protocol
> Plug-in (protocol-file)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - URL Query Filter
> (query-url)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Parse MS
> Documents Framework (lib-parsems)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Registered
> Extension-Points:
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Online
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - HTML Parse
> Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Ontology Model
> Loader (org.apache.nutch.ontology.Ontology)
> 2007-05-24 14:04:23,390 INFO fetcher.Fetcher - fetching
> smb://sql1/Sales/DATA/
> 2007-05-24 14:04:23,390 INFO fetcher.Fetcher - fetch of
> smb://sql1/Sales/DATA/ failed with:
> org.apache.nutch.protocol.ProtocolNotFound:
> java.net.MalformedURLException: unknown protocol: smb
> 2007-05-24 14:04:23,500 INFO fetcher.Fetcher - fetching
> file:///C:/Policies/
> 2007-05-24 14:04:23,718 INFO crawl.SignatureFactory - Using Signature
> impl: org.apache.nutch.crawl.MD5Signature
> 2007-05-24 14:04:24,671 INFO plugin.PluginRepository - Plugins: looking
> in: C:\nutch-0.9\plugins
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Registered
> Plugins:
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - the nutch core
> extension points (nutch-extensionpoints)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - MSPowerPoint
> Parse Plug-in (parse-mspowerpoint)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Basic Query
> Filter (query-basic)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Basic Indexing
> Filter (index-basic)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Html Parse
> Plug-in (parse-html)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Pdf Parse
> Plug-in
> (parse-pdf)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Site Query
> Filter
> (query-site)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Jakarta POI -
> Java API To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Text Parse
> Plug-in (parse-text)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - MSWord Parse
> Plug-in (parse-msword)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - SMB Protocol
> Plug-in (protocol-smb)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - MSExcel Parse
> Plug-in (parse-msexcel)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - OPIC Scoring
> Plug-in (scoring-opic)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Log4j
> (lib-log4j)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - File Protocol
> Plug-in (protocol-file)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - URL Query Filter
> (query-url)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Parse MS
> Documents Framework (lib-parsems)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Registered
> Extension-Points:
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Online
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - HTML Parse
> Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Ontology Model
> Loader (org.apache.nutch.ontology.Ontology)
> 2007-05-24 14:04:25,171 INFO fetcher.Fetcher - Fetcher: done
> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: starting
> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: db:
> crawl/crawldb
> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: segments:
> [crawl/segments/20070524140420]
> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: additions
> allowed: true
> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: URL
> normalizing: true
> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: URL
> filtering: true
> 2007-05-24 14:04:25,203 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:25,203 INFO crawl.CrawlDb - CrawlDb update: Merging
> segment data into db.
> 2007-05-24 14:04:25,421 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:25,468 INFO plugin.PluginRepository - Plugins: looking
> in: C:\nutch-0.9\plugins
> 2007-05-24 14:04:25,593 INFO plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
>
>
> Thank you for reading my post, hope you can help.
>
> Regards,
>
> Oli
>
--
View this message in context:
http://www.nabble.com/WIN-XP-PRO--Djava.protocol*-file%3A---c%3A-folder--Crawling-Parents-tf3809966.html#a10806240
Sent from the Nutch - User mailing list archive at Nabble.com.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general