ok try this,
as you see the two filters have the same entry. I dont exactly why it has to
be 2 where one would be enough but this keeps me from crawl the parent dir
aswell.
check the nutch site.xml if I put there .* it isnt working in my case so I
have to write the plugins I really need.
check also out my new SMB Protocol.
-Djava stuff
copy jcifs to
C:\Program Files\Java_jdk1.6.0_01\jre\lib\ext (in my case)
Add wollowing to the main method of crawl.java
/* Perform complete crawling and indexing given a set of root urls. */
public static void main(String args[]) throws Exception {
--> System.setProperty("java.protocol.handler.pkgs", "jcifs");
--> LOG.info("SMB Info: " +
System.getProperty("java.protocol.handler.pkgs"));
--> LOG.info("SMB Info: " + new
java.util.PropertyPermission("java.protocol.handler.pkgs","read,
write").toString());
if (args.length < 1) {
...and so on....
then you dont need to set the -Djava.. properties before starting the app.
good luck
http://www.nabble.com/file/p11047384/protocol-smb.zip protocol-smb.zip
http://www.nabble.com/file/p11047384/regex-urlfilter.txt regex-urlfilter.txt
http://www.nabble.com/file/p11047384/crawl-urlfilter.txt crawl-urlfilter.txt
http://www.nabble.com/file/p11047384/nutch-site.xml nutch-site.xml
opoole wrote:
>
> Hi Vadim,
>
> To be honest I am somewhat behind you as my problem is that I cannot get
> the SMB protocol setup, I am unable to get the -djava bit to do anything,
> I am using cygwin and entering the command from within sun\java etc.
>
> As for crawl speed, I'd love to get that far.
>
> Also I noticed that you were crawling from the root of C:\ whereas I want
> to crawl a specific folder and the parent directory issue crops up, I
> cannot get it to stop crawling the parent. One thing I had noticed is
> that I did not have a URLFILTER entry in my nucth-config.xml and that
> makes a difference in that if I try to set it up as in the tutorial it
> won't crawl a thing??!!
>
> Sorry I cannot be of help but I feel somewhat behind you in terms of Nutch
> dev, I am thinking of trying Nutch using ver 8 instead of 9 as there is
> more documented on it although I have read that it is slow, half the speed
> of ver 9 in terms of crawl speed, are you using ver 8?
>
> Regards,
>
> Oli
>
>
> Vadim B wrote:
>>
>> Could you solve the problem?
>>
>> I get about 800kb/s as transfer speed wich is not so fast to use it in
>> productiv enviroment, what about you?
>>
>>
>>
>> opoole wrote:
>>>
>>> Sorry Vadim,
>>>
>>> I did not realise you had sent me the email [Doh!].
>>>
>>>
>>> Vadim B wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am working on the same issue as you, So far I could crawl
>>>> file:///C:/* but i am stucked on the smb part. It looks to me that this
>>>> plugin isn't working properly so it needs to be fixed for the newer
>>>> version of nutch.
>>>>
>>>> The error I get differs a bit from yours it is:
>>>>
>>>> 2007-05-25 18:06:29,573 INFO fetcher.Fetcher - fetching
>>>> smb://mobidick/test/
>>>> 2007-05-25 18:06:29,573 INFO fetcher.Fetcher - fetch of
>>>> smb://mobidick/test/ failed with:
>>>> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
>>>> url=smb
>>>>
>>>> I will dive into the plugin-smb and try out to narrow the problem Maybe
>>>> we can work together to get a quick solution.
>>>>
>>>>
>>>>
>>>> ---SNIP---
>>>>
>>>> # accept hosts in MY.DOMAIN.NAME
>>>> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>>>> +^file:///C:/Policies/ <<-- why you put it here it doesn't make sense
>>>> because the +^(file|smb) line above is already fitting so this will be
>>>> skipped
>>>> ---SNIP ---
>>>>
>>>> ---SNIP ---
>>>> 2007-05-24 14:04:22,000 WARN crawl.PartitionUrlByHost - Malformed URL:
>>>> 'smb://sql1/Sales/DATA/'
>>>> //did you cuoted the url or is it displayed in the logs like this? I
>>>> dont get this error
>>>> ---SNIP ---
>>>>
>>>> try this in package org.apache.nutch.crawl.Crawl
>>>>
>>>> public static void main(String args[]) throws Exception {
>>>> System.setProperty("java.protocol.handler.pkgs", "jcifs"); // new
>>>> LOG.info("SMB Info: " +
>>>> System.getProperty("java.protocol.handler.pkgs")); //new
>>>> LOG.info("SMB Info: " + new
>>>> java.util.PropertyPermission("java.protocol.handler.pkgs","read,
>>>> write").toString());//new
>>>> if (args.length < 1) {
>>>> System.out.println
>>>> ("Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN
>>>> N]");
>>>> return;
>>>> }
>>>> ---SNIP---
>>>>
>>>> check out this:
>>>> http://java.sun.com/developer/onlineTraining/protocolhandlers/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> opoole wrote:
>>>>>
>>>>> Hi All, I hope you can help as I am becomming rather depressed with
>>>>> Nutch on Windows.
>>>>>
>>>>> Using: Windows XP Pro SP2 - Nutch-0.9 - Cygwin [current version from
>>>>> cygwin site] - Java JDK 1.6.0 - Java Platform Standard Edition 1.6.0
>>>>>
>>>>> I cannot stop Nutch from crawling parent directories, I have looked at
>>>>> other threads and none seem to work.
>>>>>
>>>>> I have tried to include protocol-smb [jcifs] but Cygwin keeps
>>>>> prompting for Java syntax corrections.
>>>>>
>>>>> Below I have listed my configurations along with the command I type in
>>>>> cygwin for jcifs:
>>>>>
>>>>> CRAWL-URLFILTER
>>>>> # The url filter file used by the crawl command.
>>>>>
>>>>> # Better for intranet crawling.
>>>>> # Be sure to change MY.DOMAIN.NAME to your domain name.
>>>>>
>>>>> # Each non-comment, non-blank line contains a regular expression
>>>>> # prefixed by '+' or '-'. The first matching pattern in the file
>>>>> # determines whether a URL is included or ignored. If no pattern
>>>>> # matches, the URL is ignored.
>>>>>
>>>>> # skip file:, ftp:, & mailto: urls
>>>>> -^(http|ftp|mailto):
>>>>> +^(file|smb):
>>>>>
>>>>> # skip image and other suffixes we can't yet parse
>>>>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>>>>>
>>>>> # skip URLs containing certain characters as probable queries, etc.
>>>>>
>>>>> # skip URLs with slash-delimited segment that repeats 3+ times, to
>>>>> break
>>>>> loops
>>>>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>>>>
>>>>> # accept hosts in MY.DOMAIN.NAME
>>>>> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>>>>> +^file:///C:/Policies/ <<-- why you put it here it doesnt make sese
>>>>> because the +^(file|smb) is already fitting !
>>>>>
>>>>> # skip everything else
>>>>> -.
>>>>>
>>>>> NUTCH-SITE
>>>>>
>>>>> <?xml version="1.0"?>
>>>>> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
>>>>> <!-- Put site-specific property overrides in this file. -->
>>>>>
>>>>> <nutch-conf>
>>>>>
>>>>> <property>
>>>>> <name>http.agent.name</name>
>>>>> <value>pascall</value>
>>>>> <description></description>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>> <name>file.content.limit</name>
>>>>> <value>-1</value>
>>>>> <description>The length limit for downloaded content, in bytes.
>>>>> If this value is nonnegative (>=0), content longer than it will be
>>>>> truncated;
>>>>> otherwise, no truncation at all.
>>>>> </description>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>> <name>file.crawl.parent</name>
>>>>> <value>false</value>
>>>>> <description>The crawler is not restricted to the directories that
>>>>> you specified in the
>>>>> Urls file but it is jumping into the parent directories as well.
>>>>> For your own crawlings you can
>>>>> change this bahavior (set to false) the way that only directories
>>>>> beneath the directories that you specify get
>>>>> crawled.</description>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>> <name>plugin.includes</name>
>>>>> <value>protocol-file|protocol-smb|scoring-opic|parse-(msexcel|mspowerpoint|msword|xml|text|html|pdf)|index-basic|query-(basic|site|url)</value>
>>>>> </property>
>>>>>
>>>>> </nutch-conf>
>>>>>
>>>>> CYGWIN
>>>>>
>>>>> Using cygwin I enter the command from C:\Sun\Java\jdk160\bin\
>>>>>
>>>>> java -Djava.protocol.handler.pkgs=jcifs
>>>>>
>>>>> When I press return the cygwin shell displays a list of java commands
>>>>> as though I am using incorrect syntax.
>>>>>
>>>>> Dump of Crawl from Cygwin:
>>>>>
>>>>> 2007-05-24 14:04:16,140 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:16,171 INFO crawl.Crawl - crawl started in: crawl
>>>>> 2007-05-24 14:04:16,171 INFO crawl.Crawl - rootUrlDir = urls.txt
>>>>> 2007-05-24 14:04:16,171 INFO crawl.Crawl - threads = 10
>>>>> 2007-05-24 14:04:16,171 INFO crawl.Crawl - depth = 5
>>>>> 2007-05-24 14:04:16,281 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:16,281 INFO crawl.Injector - Injector: starting
>>>>> 2007-05-24 14:04:16,281 INFO crawl.Injector - Injector: crawlDb:
>>>>> crawl/crawldb
>>>>> 2007-05-24 14:04:16,296 INFO crawl.Injector - Injector: urlDir:
>>>>> urls.txt
>>>>> 2007-05-24 14:04:16,296 INFO crawl.Injector - Injector: Converting
>>>>> injected urls to crawl db entries.
>>>>> 2007-05-24 14:04:16,328 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:16,843 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:16,953 INFO plugin.PluginRepository - Plugins:
>>>>> looking in: C:\nutch-0.9\plugins
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Plugin
>>>>> Auto-activation mode: [true]
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Registered
>>>>> Plugins:
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - the nutch
>>>>> core extension points (nutch-extensionpoints)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - MSPowerPoint
>>>>> Parse Plug-in (parse-mspowerpoint)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Basic Query
>>>>> Filter (query-basic)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Basic
>>>>> Indexing Filter (index-basic)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Html Parse
>>>>> Plug-in (parse-html)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Pdf Parse
>>>>> Plug-in (parse-pdf)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Site Query
>>>>> Filter (query-site)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Jakarta POI -
>>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Text Parse
>>>>> Plug-in (parse-text)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - MSWord Parse
>>>>> Plug-in (parse-msword)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - SMB Protocol
>>>>> Plug-in (protocol-smb)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - MSExcel Parse
>>>>> Plug-in (parse-msexcel)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - OPIC Scoring
>>>>> Plug-in (scoring-opic)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - CyberNeko
>>>>> HTML Parser (lib-nekohtml)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Log4j
>>>>> (lib-log4j)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - File Protocol
>>>>> Plug-in (protocol-file)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - URL Query
>>>>> Filter (query-url)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Parse MS
>>>>> Documents Framework (lib-parsems)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Registered
>>>>> Extension-Points:
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch
>>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch URL
>>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch
>>>>> Protocol (org.apache.nutch.protocol.Protocol)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch
>>>>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch URL
>>>>> Filter (org.apache.nutch.net.URLFilter)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch
>>>>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Online
>>>>> Search Results Clustering Plugin
>>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - HTML Parse
>>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Content
>>>>> Parser (org.apache.nutch.parse.Parser)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Scoring
>>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Query
>>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>>> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Ontology
>>>>> Model Loader (org.apache.nutch.ontology.Ontology)
>>>>> 2007-05-24 14:04:17,875 INFO crawl.Injector - Injector: Merging
>>>>> injected urls into crawl db.
>>>>> 2007-05-24 14:04:17,906 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:18,156 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:18,375 WARN util.NativeCodeLoader - Unable to load
>>>>> native-hadoop library for your platform... using builtin-java classes
>>>>> where applicable
>>>>> 2007-05-24 14:04:19,281 INFO crawl.Injector - Injector: done
>>>>> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: Selecting
>>>>> best-scoring urls due for fetch.
>>>>> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: starting
>>>>> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: segment:
>>>>> crawl/segments/20070524140420
>>>>> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: filtering:
>>>>> false
>>>>> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: topN:
>>>>> 2147483647
>>>>> 2007-05-24 14:04:20,312 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:20,312 INFO crawl.Generator - Generator: jobtracker
>>>>> is 'local', generating exactly one partition.
>>>>> 2007-05-24 14:04:20,562 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:20,609 INFO plugin.PluginRepository - Plugins:
>>>>> looking in: C:\nutch-0.9\plugins
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Plugin
>>>>> Auto-activation mode: [true]
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Registered
>>>>> Plugins:
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - the nutch
>>>>> core extension points (nutch-extensionpoints)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - MSPowerPoint
>>>>> Parse Plug-in (parse-mspowerpoint)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Basic Query
>>>>> Filter (query-basic)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Basic
>>>>> Indexing Filter (index-basic)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Html Parse
>>>>> Plug-in (parse-html)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Pdf Parse
>>>>> Plug-in (parse-pdf)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Site Query
>>>>> Filter (query-site)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Jakarta POI -
>>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Text Parse
>>>>> Plug-in (parse-text)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - MSWord Parse
>>>>> Plug-in (parse-msword)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - SMB Protocol
>>>>> Plug-in (protocol-smb)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - MSExcel Parse
>>>>> Plug-in (parse-msexcel)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - OPIC Scoring
>>>>> Plug-in (scoring-opic)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - CyberNeko
>>>>> HTML Parser (lib-nekohtml)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Log4j
>>>>> (lib-log4j)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - File Protocol
>>>>> Plug-in (protocol-file)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - URL Query
>>>>> Filter (query-url)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Parse MS
>>>>> Documents Framework (lib-parsems)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Registered
>>>>> Extension-Points:
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch
>>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch URL
>>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch
>>>>> Protocol (org.apache.nutch.protocol.Protocol)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch
>>>>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch URL
>>>>> Filter (org.apache.nutch.net.URLFilter)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch
>>>>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Online
>>>>> Search Results Clustering Plugin
>>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - HTML Parse
>>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Content
>>>>> Parser (org.apache.nutch.parse.Parser)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Scoring
>>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Query
>>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>>> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Ontology
>>>>> Model Loader (org.apache.nutch.ontology.Ontology)
>>>>> 2007-05-24 14:04:20,796 WARN crawl.PartitionUrlByHost - Malformed
>>>>> URL: 'smb://sql1/Sales/DATA/'
>>>>> 2007-05-24 14:04:20,843 INFO plugin.PluginRepository - Plugins:
>>>>> looking in: C:\nutch-0.9\plugins
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Plugin
>>>>> Auto-activation mode: [true]
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Registered
>>>>> Plugins:
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - the nutch
>>>>> core extension points (nutch-extensionpoints)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - MSPowerPoint
>>>>> Parse Plug-in (parse-mspowerpoint)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Basic Query
>>>>> Filter (query-basic)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Basic
>>>>> Indexing Filter (index-basic)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Html Parse
>>>>> Plug-in (parse-html)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Pdf Parse
>>>>> Plug-in (parse-pdf)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Site Query
>>>>> Filter (query-site)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Jakarta POI -
>>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Text Parse
>>>>> Plug-in (parse-text)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - MSWord Parse
>>>>> Plug-in (parse-msword)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - SMB Protocol
>>>>> Plug-in (protocol-smb)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - MSExcel Parse
>>>>> Plug-in (parse-msexcel)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - OPIC Scoring
>>>>> Plug-in (scoring-opic)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - CyberNeko
>>>>> HTML Parser (lib-nekohtml)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Log4j
>>>>> (lib-log4j)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - File Protocol
>>>>> Plug-in (protocol-file)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - URL Query
>>>>> Filter (query-url)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Parse MS
>>>>> Documents Framework (lib-parsems)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Registered
>>>>> Extension-Points:
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch
>>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch URL
>>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch
>>>>> Protocol (org.apache.nutch.protocol.Protocol)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch
>>>>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch URL
>>>>> Filter (org.apache.nutch.net.URLFilter)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch
>>>>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Online
>>>>> Search Results Clustering Plugin
>>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - HTML Parse
>>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Content
>>>>> Parser (org.apache.nutch.parse.Parser)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Scoring
>>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Query
>>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>>> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Ontology
>>>>> Model Loader (org.apache.nutch.ontology.Ontology)
>>>>> 2007-05-24 14:04:21,578 INFO crawl.Generator - Generator:
>>>>> Partitioning selected urls by host, for politeness.
>>>>> 2007-05-24 14:04:21,593 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:21,828 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:21,859 INFO plugin.PluginRepository - Plugins:
>>>>> looking in: C:\nutch-0.9\plugins
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Plugin
>>>>> Auto-activation mode: [true]
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Registered
>>>>> Plugins:
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - the nutch
>>>>> core extension points (nutch-extensionpoints)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - MSPowerPoint
>>>>> Parse Plug-in (parse-mspowerpoint)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Basic Query
>>>>> Filter (query-basic)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Basic
>>>>> Indexing Filter (index-basic)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Html Parse
>>>>> Plug-in (parse-html)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Pdf Parse
>>>>> Plug-in (parse-pdf)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Site Query
>>>>> Filter (query-site)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Jakarta POI -
>>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Text Parse
>>>>> Plug-in (parse-text)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - MSWord Parse
>>>>> Plug-in (parse-msword)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - SMB Protocol
>>>>> Plug-in (protocol-smb)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - MSExcel Parse
>>>>> Plug-in (parse-msexcel)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - OPIC Scoring
>>>>> Plug-in (scoring-opic)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - CyberNeko
>>>>> HTML Parser (lib-nekohtml)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Log4j
>>>>> (lib-log4j)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - File Protocol
>>>>> Plug-in (protocol-file)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - URL Query
>>>>> Filter (query-url)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Parse MS
>>>>> Documents Framework (lib-parsems)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Registered
>>>>> Extension-Points:
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch
>>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch URL
>>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch
>>>>> Protocol (org.apache.nutch.protocol.Protocol)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch
>>>>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch URL
>>>>> Filter (org.apache.nutch.net.URLFilter)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch
>>>>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Online
>>>>> Search Results Clustering Plugin
>>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - HTML Parse
>>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Content
>>>>> Parser (org.apache.nutch.parse.Parser)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Scoring
>>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Query
>>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>>> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Ontology
>>>>> Model Loader (org.apache.nutch.ontology.Ontology)
>>>>> 2007-05-24 14:04:22,000 WARN crawl.PartitionUrlByHost - Malformed
>>>>> URL: 'smb://sql1/Sales/DATA/'
>>>>> 2007-05-24 14:04:22,843 INFO crawl.Generator - Generator: done.
>>>>> 2007-05-24 14:04:22,843 INFO fetcher.Fetcher - Fetcher: starting
>>>>> 2007-05-24 14:04:22,843 INFO fetcher.Fetcher - Fetcher: segment:
>>>>> crawl/segments/20070524140420
>>>>> 2007-05-24 14:04:22,859 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:23,156 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:23,187 INFO fetcher.Fetcher - Fetcher: threads: 10
>>>>> 2007-05-24 14:04:23,203 INFO plugin.PluginRepository - Plugins:
>>>>> looking in: C:\nutch-0.9\plugins
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Plugin
>>>>> Auto-activation mode: [true]
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Registered
>>>>> Plugins:
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - the nutch
>>>>> core extension points (nutch-extensionpoints)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - MSPowerPoint
>>>>> Parse Plug-in (parse-mspowerpoint)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Basic Query
>>>>> Filter (query-basic)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Basic
>>>>> Indexing Filter (index-basic)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Html Parse
>>>>> Plug-in (parse-html)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Pdf Parse
>>>>> Plug-in (parse-pdf)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Site Query
>>>>> Filter (query-site)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Jakarta POI -
>>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Text Parse
>>>>> Plug-in (parse-text)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - MSWord Parse
>>>>> Plug-in (parse-msword)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - SMB Protocol
>>>>> Plug-in (protocol-smb)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - MSExcel Parse
>>>>> Plug-in (parse-msexcel)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - OPIC Scoring
>>>>> Plug-in (scoring-opic)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - CyberNeko
>>>>> HTML Parser (lib-nekohtml)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Log4j
>>>>> (lib-log4j)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - File Protocol
>>>>> Plug-in (protocol-file)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - URL Query
>>>>> Filter (query-url)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Parse MS
>>>>> Documents Framework (lib-parsems)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Registered
>>>>> Extension-Points:
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch
>>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch URL
>>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch
>>>>> Protocol (org.apache.nutch.protocol.Protocol)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch
>>>>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch URL
>>>>> Filter (org.apache.nutch.net.URLFilter)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch
>>>>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Online
>>>>> Search Results Clustering Plugin
>>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - HTML Parse
>>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Content
>>>>> Parser (org.apache.nutch.parse.Parser)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Scoring
>>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Query
>>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>>> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Ontology
>>>>> Model Loader (org.apache.nutch.ontology.Ontology)
>>>>> 2007-05-24 14:04:23,390 INFO fetcher.Fetcher - fetching
>>>>> smb://sql1/Sales/DATA/
>>>>> 2007-05-24 14:04:23,390 INFO fetcher.Fetcher - fetch of
>>>>> smb://sql1/Sales/DATA/ failed with:
>>>>> org.apache.nutch.protocol.ProtocolNotFound:
>>>>> java.net.MalformedURLException: unknown protocol: smb
>>>>> 2007-05-24 14:04:23,500 INFO fetcher.Fetcher - fetching
>>>>> file:///C:/Policies/
>>>>> 2007-05-24 14:04:23,718 INFO crawl.SignatureFactory - Using Signature
>>>>> impl: org.apache.nutch.crawl.MD5Signature
>>>>> 2007-05-24 14:04:24,671 INFO plugin.PluginRepository - Plugins:
>>>>> looking in: C:\nutch-0.9\plugins
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Plugin
>>>>> Auto-activation mode: [true]
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Registered
>>>>> Plugins:
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - the nutch
>>>>> core extension points (nutch-extensionpoints)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - MSPowerPoint
>>>>> Parse Plug-in (parse-mspowerpoint)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Basic Query
>>>>> Filter (query-basic)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Basic
>>>>> Indexing Filter (index-basic)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Html Parse
>>>>> Plug-in (parse-html)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Pdf Parse
>>>>> Plug-in (parse-pdf)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Site Query
>>>>> Filter (query-site)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Jakarta POI -
>>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Text Parse
>>>>> Plug-in (parse-text)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - MSWord Parse
>>>>> Plug-in (parse-msword)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - SMB Protocol
>>>>> Plug-in (protocol-smb)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - MSExcel Parse
>>>>> Plug-in (parse-msexcel)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - OPIC Scoring
>>>>> Plug-in (scoring-opic)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - CyberNeko
>>>>> HTML Parser (lib-nekohtml)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Log4j
>>>>> (lib-log4j)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - File Protocol
>>>>> Plug-in (protocol-file)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - URL Query
>>>>> Filter (query-url)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Parse MS
>>>>> Documents Framework (lib-parsems)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Registered
>>>>> Extension-Points:
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch
>>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch URL
>>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch
>>>>> Protocol (org.apache.nutch.protocol.Protocol)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch
>>>>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch URL
>>>>> Filter (org.apache.nutch.net.URLFilter)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch
>>>>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Online
>>>>> Search Results Clustering Plugin
>>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - HTML Parse
>>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Content
>>>>> Parser (org.apache.nutch.parse.Parser)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Scoring
>>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Query
>>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>>> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Ontology
>>>>> Model Loader (org.apache.nutch.ontology.Ontology)
>>>>> 2007-05-24 14:04:25,171 INFO fetcher.Fetcher - Fetcher: done
>>>>> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: starting
>>>>> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: db:
>>>>> crawl/crawldb
>>>>> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update:
>>>>> segments: [crawl/segments/20070524140420]
>>>>> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update:
>>>>> additions allowed: true
>>>>> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: URL
>>>>> normalizing: true
>>>>> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: URL
>>>>> filtering: true
>>>>> 2007-05-24 14:04:25,203 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:25,203 INFO crawl.CrawlDb - CrawlDb update: Merging
>>>>> segment data into db.
>>>>> 2007-05-24 14:04:25,421 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:25,468 INFO plugin.PluginRepository - Plugins:
>>>>> looking in: C:\nutch-0.9\plugins
>>>>> 2007-05-24 14:04:25,593 INFO plugin.PluginRepository - Plugin
>>>>> Auto-activation mode: [true]
>>>>>
>>>>>
>>>>> Thank you for reading my post, hope you can help.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Oli
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
--
View this message in context:
http://www.nabble.com/WIN-XP-PRO--Djava.protocol*-file%3A---c%3A-folder--Crawling-Parents-tf3809966.html#a11047384
Sent from the Nutch - User mailing list archive at Nabble.com.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general