Sorry Vadim,
I did not realise you had sent me the email [Doh!].
Vadim B wrote:
>
> Hi,
>
> I am working on the same issue as you, So far I could crawl file:///C:/*
> but i am stucked on the smb part. It looks to me that this plugin isn't
> working properly so it needs to be fixed for the newer version of nutch.
>
> The error I get differs a bit from yours it is:
>
> 2007-05-25 18:06:29,573 INFO fetcher.Fetcher - fetching
> smb://mobidick/test/
> 2007-05-25 18:06:29,573 INFO fetcher.Fetcher - fetch of
> smb://mobidick/test/ failed with:
> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=smb
>
> I will dive into the plugin-smb and try out to narrow the problem Maybe we
> can work together to get a quick solution.
>
>
>
> ---SNIP---
>
> # accept hosts in MY.DOMAIN.NAME
> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> +^file:///C:/Policies/ <<-- why you put it here it doesn't make sense
> because the +^(file|smb) line above is already fitting so this will be
> skipped
> ---SNIP ---
>
> ---SNIP ---
> 2007-05-24 14:04:22,000 WARN crawl.PartitionUrlByHost - Malformed URL:
> 'smb://sql1/Sales/DATA/'
> //did you cuoted the url or is it displayed in the logs like this? I dont
> get this error
> ---SNIP ---
>
> try this in package org.apache.nutch.crawl.Crawl
>
> public static void main(String args[]) throws Exception {
> System.setProperty("java.protocol.handler.pkgs", "jcifs"); // new
> LOG.info("SMB Info: " +
> System.getProperty("java.protocol.handler.pkgs")); //new
> LOG.info("SMB Info: " + new
> java.util.PropertyPermission("java.protocol.handler.pkgs","read,
> write").toString());//new
> if (args.length < 1) {
> System.out.println
> ("Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN
> N]");
> return;
> }
> ---SNIP---
>
> check out this:
> http://java.sun.com/developer/onlineTraining/protocolhandlers/
>
>
>
>
>
> opoole wrote:
>>
>> Hi All, I hope you can help as I am becomming rather depressed with Nutch
>> on Windows.
>>
>> Using: Windows XP Pro SP2 - Nutch-0.9 - Cygwin [current version from
>> cygwin site] - Java JDK 1.6.0 - Java Platform Standard Edition 1.6.0
>>
>> I cannot stop Nutch from crawling parent directories, I have looked at
>> other threads and none seem to work.
>>
>> I have tried to include protocol-smb [jcifs] but Cygwin keeps prompting
>> for Java syntax corrections.
>>
>> Below I have listed my configurations along with the command I type in
>> cygwin for jcifs:
>>
>> CRAWL-URLFILTER
>> # The url filter file used by the crawl command.
>>
>> # Better for intranet crawling.
>> # Be sure to change MY.DOMAIN.NAME to your domain name.
>>
>> # Each non-comment, non-blank line contains a regular expression
>> # prefixed by '+' or '-'. The first matching pattern in the file
>> # determines whether a URL is included or ignored. If no pattern
>> # matches, the URL is ignored.
>>
>> # skip file:, ftp:, & mailto: urls
>> -^(http|ftp|mailto):
>> +^(file|smb):
>>
>> # skip image and other suffixes we can't yet parse
>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>>
>> # skip URLs containing certain characters as probable queries, etc.
>>
>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>> loops
>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>
>> # accept hosts in MY.DOMAIN.NAME
>> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>> +^file:///C:/Policies/ <<-- why you put it here it doesnt make sese
>> because the +^(file|smb) is already fitting !
>>
>> # skip everything else
>> -.
>>
>> NUTCH-SITE
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <nutch-conf>
>>
>> <property>
>> <name>http.agent.name</name>
>> <value>pascall</value>
>> <description></description>
>> </property>
>>
>> <property>
>> <name>file.content.limit</name>
>> <value>-1</value>
>> <description>The length limit for downloaded content, in bytes.
>> If this value is nonnegative (>=0), content longer than it will be
>> truncated;
>> otherwise, no truncation at all.
>> </description>
>> </property>
>>
>> <property>
>> <name>file.crawl.parent</name>
>> <value>false</value>
>> <description>The crawler is not restricted to the directories that you
>> specified in the
>> Urls file but it is jumping into the parent directories as well. For
>> your own crawlings you can
>> change this bahavior (set to false) the way that only directories
>> beneath the directories that you specify get
>> crawled.</description>
>> </property>
>>
>> <property>
>> <name>plugin.includes</name>
>> <value>protocol-file|protocol-smb|scoring-opic|parse-(msexcel|mspowerpoint|msword|xml|text|html|pdf)|index-basic|query-(basic|site|url)</value>
>> </property>
>>
>> </nutch-conf>
>>
>> CYGWIN
>>
>> Using cygwin I enter the command from C:\Sun\Java\jdk160\bin\
>>
>> java -Djava.protocol.handler.pkgs=jcifs
>>
>> When I press return the cygwin shell displays a list of java commands as
>> though I am using incorrect syntax.
>>
>> Dump of Crawl from Cygwin:
>>
>> 2007-05-24 14:04:16,140 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:16,171 INFO crawl.Crawl - crawl started in: crawl
>> 2007-05-24 14:04:16,171 INFO crawl.Crawl - rootUrlDir = urls.txt
>> 2007-05-24 14:04:16,171 INFO crawl.Crawl - threads = 10
>> 2007-05-24 14:04:16,171 INFO crawl.Crawl - depth = 5
>> 2007-05-24 14:04:16,281 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:16,281 INFO crawl.Injector - Injector: starting
>> 2007-05-24 14:04:16,281 INFO crawl.Injector - Injector: crawlDb:
>> crawl/crawldb
>> 2007-05-24 14:04:16,296 INFO crawl.Injector - Injector: urlDir: urls.txt
>> 2007-05-24 14:04:16,296 INFO crawl.Injector - Injector: Converting
>> injected urls to crawl db entries.
>> 2007-05-24 14:04:16,328 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:16,843 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:16,953 INFO plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:17,875 INFO crawl.Injector - Injector: Merging injected
>> urls into crawl db.
>> 2007-05-24 14:04:17,906 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:18,156 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:18,375 WARN util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where applicable
>> 2007-05-24 14:04:19,281 INFO crawl.Injector - Injector: done
>> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: Selecting
>> best-scoring urls due for fetch.
>> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: starting
>> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: segment:
>> crawl/segments/20070524140420
>> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: filtering:
>> false
>> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: topN:
>> 2147483647
>> 2007-05-24 14:04:20,312 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:20,312 INFO crawl.Generator - Generator: jobtracker is
>> 'local', generating exactly one partition.
>> 2007-05-24 14:04:20,562 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:20,609 INFO plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:20,796 WARN crawl.PartitionUrlByHost - Malformed URL:
>> 'smb://sql1/Sales/DATA/'
>> 2007-05-24 14:04:20,843 INFO plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:21,578 INFO crawl.Generator - Generator: Partitioning
>> selected urls by host, for politeness.
>> 2007-05-24 14:04:21,593 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:21,828 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:21,859 INFO plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:22,000 WARN crawl.PartitionUrlByHost - Malformed URL:
>> 'smb://sql1/Sales/DATA/'
>> 2007-05-24 14:04:22,843 INFO crawl.Generator - Generator: done.
>> 2007-05-24 14:04:22,843 INFO fetcher.Fetcher - Fetcher: starting
>> 2007-05-24 14:04:22,843 INFO fetcher.Fetcher - Fetcher: segment:
>> crawl/segments/20070524140420
>> 2007-05-24 14:04:22,859 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:23,156 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:23,187 INFO fetcher.Fetcher - Fetcher: threads: 10
>> 2007-05-24 14:04:23,203 INFO plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:23,390 INFO fetcher.Fetcher - fetching
>> smb://sql1/Sales/DATA/
>> 2007-05-24 14:04:23,390 INFO fetcher.Fetcher - fetch of
>> smb://sql1/Sales/DATA/ failed with:
>> org.apache.nutch.protocol.ProtocolNotFound:
>> java.net.MalformedURLException: unknown protocol: smb
>> 2007-05-24 14:04:23,500 INFO fetcher.Fetcher - fetching
>> file:///C:/Policies/
>> 2007-05-24 14:04:23,718 INFO crawl.SignatureFactory - Using Signature
>> impl: org.apache.nutch.crawl.MD5Signature
>> 2007-05-24 14:04:24,671 INFO plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:25,171 INFO fetcher.Fetcher - Fetcher: done
>> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: starting
>> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: db:
>> crawl/crawldb
>> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: segments:
>> [crawl/segments/20070524140420]
>> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: additions
>> allowed: true
>> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: URL
>> normalizing: true
>> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: URL
>> filtering: true
>> 2007-05-24 14:04:25,203 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:25,203 INFO crawl.CrawlDb - CrawlDb update: Merging
>> segment data into db.
>> 2007-05-24 14:04:25,421 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:25,468 INFO plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:25,593 INFO plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>>
>>
>> Thank you for reading my post, hope you can help.
>>
>> Regards,
>>
>> Oli
>>
>
>
--
View this message in context:
http://www.nabble.com/WIN-XP-PRO--Djava.protocol*-file%3A---c%3A-folder--Crawling-Parents-tf3809966.html#a10852315
Sent from the Nutch - User mailing list archive at Nabble.com.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general