Hii All,
I tried to crawl my local filesystem and got the following error.
I am using Windows NT and nutch-0.8.1
I have modified my crawl-urlfilter.txt entry as follows:
# skip http:, ftp:, ,:https:& mailto: urls
-^(http|ftp|mailto|https):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/
# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*apache.org/
# skip everything else
#-. #Changed
# accept anything else
+.*
-------------------------------------------------------------------------------------------------
In nutch_site.xml I have added the plug in for file as follows.
<property>
<name>plugin.includes</name>
<value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
</property>
<property>
-------------------------------------------------------------------------------------------------
my urls containing
file:///C:/check/
-------------------------------------------------------------------------------------------------
The error is listed below, no protocol found for url=file
Injector: starting
Injector: crawlDb: localfs/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: starting
Generator: segment: localfs/segments/20070126152212
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: localfs/segments/20070126152212
Fetcher: threads: 10
fetching file:///c:/check/
fetch of file:///c:/check/ failed with:
org.apache.nutch.protocol.ProtocolNotFou
nd: protocol not found for url=file
Fetcher: done
---------------------------------------------------------------------------------------------------------
Please any one help me, Thanks in advance. Its very urgent
too. Is there any other things to be done?
Regards,
Abhilash
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general