Hi, Thanks for your help with this, I was sent an email from someone stating that this is fixed using a new version of the jcifs implementation:
https://issues.apache.org/jira/browse/NUTCH-427 Give it a go and let me know if it works ;) Vadim B wrote: > > Hi, > > I am working on the same issue as you, So far I could crawl file:///C:/* > but i am stucked on the smb part. It looks to me that this plugin isn't > working properly so it needs to be fixed for the newer version of nutch. > > The error I get differs a bit from yours it is: > > 2007-05-25 18:06:29,573 INFO fetcher.Fetcher - fetching > smb://mobidick/test/ > 2007-05-25 18:06:29,573 INFO fetcher.Fetcher - fetch of > smb://mobidick/test/ failed with: > org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=smb > > I will dive into the plugin-smb and try out to narrow the problem Maybe we > can work together to get a quick solution. > > > > ---SNIP--- > > # accept hosts in MY.DOMAIN.NAME > # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ > +^file:///C:/Policies/ <<-- why you put it here it doesn't make sense > because the +^(file|smb) line above is already fitting so this will be > skipped > ---SNIP --- > > ---SNIP --- > 2007-05-24 14:04:22,000 WARN crawl.PartitionUrlByHost - Malformed URL: > 'smb://sql1/Sales/DATA/' > //did you cuoted the url or is it displayed in the logs like this? I dont > get this error > ---SNIP --- > > try this in package org.apache.nutch.crawl.Crawl > > public static void main(String args[]) throws Exception { > System.setProperty("java.protocol.handler.pkgs", "jcifs"); // new > LOG.info("SMB Info: " + > System.getProperty("java.protocol.handler.pkgs")); //new > LOG.info("SMB Info: " + new > java.util.PropertyPermission("java.protocol.handler.pkgs","read, > write").toString());//new > if (args.length < 1) { > System.out.println > ("Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN > N]"); > return; > } > ---SNIP--- > > check out this: > http://java.sun.com/developer/onlineTraining/protocolhandlers/ > > > > > > opoole wrote: >> >> Hi All, I hope you can help as I am becomming rather depressed with Nutch >> on Windows. >> >> Using: Windows XP Pro SP2 - Nutch-0.9 - Cygwin [current version from >> cygwin site] - Java JDK 1.6.0 - Java Platform Standard Edition 1.6.0 >> >> I cannot stop Nutch from crawling parent directories, I have looked at >> other threads and none seem to work. >> >> I have tried to include protocol-smb [jcifs] but Cygwin keeps prompting >> for Java syntax corrections. >> >> Below I have listed my configurations along with the command I type in >> cygwin for jcifs: >> >> CRAWL-URLFILTER >> # The url filter file used by the crawl command. >> >> # Better for intranet crawling. >> # Be sure to change MY.DOMAIN.NAME to your domain name. >> >> # Each non-comment, non-blank line contains a regular expression >> # prefixed by '+' or '-'. The first matching pattern in the file >> # determines whether a URL is included or ignored. If no pattern >> # matches, the URL is ignored. >> >> # skip file:, ftp:, & mailto: urls >> -^(http|ftp|mailto): >> +^(file|smb): >> >> # skip image and other suffixes we can't yet parse >> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ >> >> # skip URLs containing certain characters as probable queries, etc. >> >> # skip URLs with slash-delimited segment that repeats 3+ times, to break >> loops >> -.*(/[^/]+)/[^/]+\1/[^/]+\1/ >> >> # accept hosts in MY.DOMAIN.NAME >> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ >> +^file:///C:/Policies/ <<-- why you put it here it doesnt make sese >> because the +^(file|smb) is already fitting ! >> >> # skip everything else >> -. >> >> NUTCH-SITE >> >> <?xml version="1.0"?> >> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?> >> <!-- Put site-specific property overrides in this file. --> >> >> <nutch-conf> >> >> <property> >> <name>http.agent.name</name> >> <value>pascall</value> >> <description></description> >> </property> >> >> <property> >> <name>file.content.limit</name> >> <value>-1</value> >> <description>The length limit for downloaded content, in bytes. >> If this value is nonnegative (>=0), content longer than it will be >> truncated; >> otherwise, no truncation at all. >> </description> >> </property> >> >> <property> >> <name>file.crawl.parent</name> >> <value>false</value> >> <description>The crawler is not restricted to the directories that you >> specified in the >> Urls file but it is jumping into the parent directories as well. For >> your own crawlings you can >> change this bahavior (set to false) the way that only directories >> beneath the directories that you specify get >> crawled.</description> >> </property> >> >> <property> >> <name>plugin.includes</name> >> <value>protocol-file|protocol-smb|scoring-opic|parse-(msexcel|mspowerpoint|msword|xml|text|html|pdf)|index-basic|query-(basic|site|url)</value> >> </property> >> >> </nutch-conf> >> >> CYGWIN >> >> Using cygwin I enter the command from C:\Sun\Java\jdk160\bin\ >> >> java -Djava.protocol.handler.pkgs=jcifs >> >> When I press return the cygwin shell displays a list of java commands as >> though I am using incorrect syntax. >> >> Dump of Crawl from Cygwin: >> >> 2007-05-24 14:04:16,140 FATAL conf.Configuration - bad conf file: >> top-level element not <configuration> >> 2007-05-24 14:04:16,171 INFO crawl.Crawl - crawl started in: crawl >> 2007-05-24 14:04:16,171 INFO crawl.Crawl - rootUrlDir = urls.txt >> 2007-05-24 14:04:16,171 INFO crawl.Crawl - threads = 10 >> 2007-05-24 14:04:16,171 INFO crawl.Crawl - depth = 5 >> 2007-05-24 14:04:16,281 FATAL conf.Configuration - bad conf file: >> top-level element not <configuration> >> 2007-05-24 14:04:16,281 INFO crawl.Injector - Injector: starting >> 2007-05-24 14:04:16,281 INFO crawl.Injector - Injector: crawlDb: >> crawl/crawldb >> 2007-05-24 14:04:16,296 INFO crawl.Injector - Injector: urlDir: urls.txt >> 2007-05-24 14:04:16,296 INFO crawl.Injector - Injector: Converting >> injected urls to crawl db entries. >> 2007-05-24 14:04:16,328 FATAL conf.Configuration - bad conf file: >> top-level element not <configuration> >> 2007-05-24 14:04:16,843 FATAL conf.Configuration - bad conf file: >> top-level element not <configuration> >> 2007-05-24 14:04:16,953 INFO plugin.PluginRepository - Plugins: looking >> in: C:\nutch-0.9\plugins >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Plugin >> Auto-activation mode: [true] >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Registered >> Plugins: >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - the nutch core >> extension points (nutch-extensionpoints) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - MSPowerPoint >> Parse Plug-in (parse-mspowerpoint) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Basic Query >> Filter (query-basic) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Basic Indexing >> Filter (index-basic) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Html Parse >> Plug-in (parse-html) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Pdf Parse >> Plug-in (parse-pdf) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Site Query >> Filter (query-site) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Jakarta POI - >> Java API To Access Microsoft Format Files (lib-jakarta-poi) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Text Parse >> Plug-in (parse-text) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - MSWord Parse >> Plug-in (parse-msword) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - SMB Protocol >> Plug-in (protocol-smb) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - MSExcel Parse >> Plug-in (parse-msexcel) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - OPIC Scoring >> Plug-in (scoring-opic) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - CyberNeko HTML >> Parser (lib-nekohtml) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Log4j >> (lib-log4j) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - File Protocol >> Plug-in (protocol-file) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - URL Query Filter >> (query-url) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Parse MS >> Documents Framework (lib-parsems) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Registered >> Extension-Points: >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Summarizer >> (org.apache.nutch.searcher.Summarizer) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch URL >> Normalizer (org.apache.nutch.net.URLNormalizer) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Protocol >> (org.apache.nutch.protocol.Protocol) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Analysis >> (org.apache.nutch.analysis.NutchAnalyzer) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch URL Filter >> (org.apache.nutch.net.URLFilter) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Indexing >> Filter (org.apache.nutch.indexer.IndexingFilter) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Online >> Search Results Clustering Plugin >> (org.apache.nutch.clustering.OnlineClusterer) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - HTML Parse >> Filter (org.apache.nutch.parse.HtmlParseFilter) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Content >> Parser (org.apache.nutch.parse.Parser) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Scoring >> (org.apache.nutch.scoring.ScoringFilter) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Query >> Filter (org.apache.nutch.searcher.QueryFilter) >> 2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Ontology Model >> Loader (org.apache.nutch.ontology.Ontology) >> 2007-05-24 14:04:17,875 INFO crawl.Injector - Injector: Merging injected >> urls into crawl db. >> 2007-05-24 14:04:17,906 FATAL conf.Configuration - bad conf file: >> top-level element not <configuration> >> 2007-05-24 14:04:18,156 FATAL conf.Configuration - bad conf file: >> top-level element not <configuration> >> 2007-05-24 14:04:18,375 WARN util.NativeCodeLoader - Unable to load >> native-hadoop library for your platform... using builtin-java classes >> where applicable >> 2007-05-24 14:04:19,281 INFO crawl.Injector - Injector: done >> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: Selecting >> best-scoring urls due for fetch. >> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: starting >> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: segment: >> crawl/segments/20070524140420 >> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: filtering: >> false >> 2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: topN: >> 2147483647 >> 2007-05-24 14:04:20,312 FATAL conf.Configuration - bad conf file: >> top-level element not <configuration> >> 2007-05-24 14:04:20,312 INFO crawl.Generator - Generator: jobtracker is >> 'local', generating exactly one partition. >> 2007-05-24 14:04:20,562 FATAL conf.Configuration - bad conf file: >> top-level element not <configuration> >> 2007-05-24 14:04:20,609 INFO plugin.PluginRepository - Plugins: looking >> in: C:\nutch-0.9\plugins >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Plugin >> Auto-activation mode: [true] >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Registered >> Plugins: >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - the nutch core >> extension points (nutch-extensionpoints) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - MSPowerPoint >> Parse Plug-in (parse-mspowerpoint) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Basic Query >> Filter (query-basic) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Basic Indexing >> Filter (index-basic) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Html Parse >> Plug-in (parse-html) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Pdf Parse >> Plug-in (parse-pdf) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Site Query >> Filter (query-site) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Jakarta POI - >> Java API To Access Microsoft Format Files (lib-jakarta-poi) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Text Parse >> Plug-in (parse-text) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - MSWord Parse >> Plug-in (parse-msword) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - SMB Protocol >> Plug-in (protocol-smb) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - MSExcel Parse >> Plug-in (parse-msexcel) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - OPIC Scoring >> Plug-in (scoring-opic) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - CyberNeko HTML >> Parser (lib-nekohtml) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Log4j >> (lib-log4j) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - File Protocol >> Plug-in (protocol-file) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - URL Query Filter >> (query-url) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Parse MS >> Documents Framework (lib-parsems) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Registered >> Extension-Points: >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Summarizer >> (org.apache.nutch.searcher.Summarizer) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch URL >> Normalizer (org.apache.nutch.net.URLNormalizer) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Protocol >> (org.apache.nutch.protocol.Protocol) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Analysis >> (org.apache.nutch.analysis.NutchAnalyzer) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch URL Filter >> (org.apache.nutch.net.URLFilter) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Indexing >> Filter (org.apache.nutch.indexer.IndexingFilter) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Online >> Search Results Clustering Plugin >> (org.apache.nutch.clustering.OnlineClusterer) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - HTML Parse >> Filter (org.apache.nutch.parse.HtmlParseFilter) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Content >> Parser (org.apache.nutch.parse.Parser) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Scoring >> (org.apache.nutch.scoring.ScoringFilter) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Query >> Filter (org.apache.nutch.searcher.QueryFilter) >> 2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Ontology Model >> Loader (org.apache.nutch.ontology.Ontology) >> 2007-05-24 14:04:20,796 WARN crawl.PartitionUrlByHost - Malformed URL: >> 'smb://sql1/Sales/DATA/' >> 2007-05-24 14:04:20,843 INFO plugin.PluginRepository - Plugins: looking >> in: C:\nutch-0.9\plugins >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Plugin >> Auto-activation mode: [true] >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Registered >> Plugins: >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - the nutch core >> extension points (nutch-extensionpoints) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - MSPowerPoint >> Parse Plug-in (parse-mspowerpoint) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Basic Query >> Filter (query-basic) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Basic Indexing >> Filter (index-basic) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Html Parse >> Plug-in (parse-html) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Pdf Parse >> Plug-in (parse-pdf) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Site Query >> Filter (query-site) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Jakarta POI - >> Java API To Access Microsoft Format Files (lib-jakarta-poi) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Text Parse >> Plug-in (parse-text) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - MSWord Parse >> Plug-in (parse-msword) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - SMB Protocol >> Plug-in (protocol-smb) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - MSExcel Parse >> Plug-in (parse-msexcel) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - OPIC Scoring >> Plug-in (scoring-opic) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - CyberNeko HTML >> Parser (lib-nekohtml) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Log4j >> (lib-log4j) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - File Protocol >> Plug-in (protocol-file) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - URL Query Filter >> (query-url) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Parse MS >> Documents Framework (lib-parsems) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Registered >> Extension-Points: >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Summarizer >> (org.apache.nutch.searcher.Summarizer) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch URL >> Normalizer (org.apache.nutch.net.URLNormalizer) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Protocol >> (org.apache.nutch.protocol.Protocol) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Analysis >> (org.apache.nutch.analysis.NutchAnalyzer) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch URL Filter >> (org.apache.nutch.net.URLFilter) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Indexing >> Filter (org.apache.nutch.indexer.IndexingFilter) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Online >> Search Results Clustering Plugin >> (org.apache.nutch.clustering.OnlineClusterer) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - HTML Parse >> Filter (org.apache.nutch.parse.HtmlParseFilter) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Content >> Parser (org.apache.nutch.parse.Parser) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Scoring >> (org.apache.nutch.scoring.ScoringFilter) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Query >> Filter (org.apache.nutch.searcher.QueryFilter) >> 2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Ontology Model >> Loader (org.apache.nutch.ontology.Ontology) >> 2007-05-24 14:04:21,578 INFO crawl.Generator - Generator: Partitioning >> selected urls by host, for politeness. >> 2007-05-24 14:04:21,593 FATAL conf.Configuration - bad conf file: >> top-level element not <configuration> >> 2007-05-24 14:04:21,828 FATAL conf.Configuration - bad conf file: >> top-level element not <configuration> >> 2007-05-24 14:04:21,859 INFO plugin.PluginRepository - Plugins: looking >> in: C:\nutch-0.9\plugins >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Plugin >> Auto-activation mode: [true] >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Registered >> Plugins: >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - the nutch core >> extension points (nutch-extensionpoints) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - MSPowerPoint >> Parse Plug-in (parse-mspowerpoint) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Basic Query >> Filter (query-basic) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Basic Indexing >> Filter (index-basic) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Html Parse >> Plug-in (parse-html) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Pdf Parse >> Plug-in (parse-pdf) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Site Query >> Filter (query-site) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Jakarta POI - >> Java API To Access Microsoft Format Files (lib-jakarta-poi) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Text Parse >> Plug-in (parse-text) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - MSWord Parse >> Plug-in (parse-msword) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - SMB Protocol >> Plug-in (protocol-smb) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - MSExcel Parse >> Plug-in (parse-msexcel) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - OPIC Scoring >> Plug-in (scoring-opic) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - CyberNeko HTML >> Parser (lib-nekohtml) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Log4j >> (lib-log4j) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - File Protocol >> Plug-in (protocol-file) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - URL Query Filter >> (query-url) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Parse MS >> Documents Framework (lib-parsems) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Registered >> Extension-Points: >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Summarizer >> (org.apache.nutch.searcher.Summarizer) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch URL >> Normalizer (org.apache.nutch.net.URLNormalizer) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Protocol >> (org.apache.nutch.protocol.Protocol) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Analysis >> (org.apache.nutch.analysis.NutchAnalyzer) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch URL Filter >> (org.apache.nutch.net.URLFilter) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Indexing >> Filter (org.apache.nutch.indexer.IndexingFilter) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Online >> Search Results Clustering Plugin >> (org.apache.nutch.clustering.OnlineClusterer) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - HTML Parse >> Filter (org.apache.nutch.parse.HtmlParseFilter) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Content >> Parser (org.apache.nutch.parse.Parser) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Scoring >> (org.apache.nutch.scoring.ScoringFilter) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Query >> Filter (org.apache.nutch.searcher.QueryFilter) >> 2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Ontology Model >> Loader (org.apache.nutch.ontology.Ontology) >> 2007-05-24 14:04:22,000 WARN crawl.PartitionUrlByHost - Malformed URL: >> 'smb://sql1/Sales/DATA/' >> 2007-05-24 14:04:22,843 INFO crawl.Generator - Generator: done. >> 2007-05-24 14:04:22,843 INFO fetcher.Fetcher - Fetcher: starting >> 2007-05-24 14:04:22,843 INFO fetcher.Fetcher - Fetcher: segment: >> crawl/segments/20070524140420 >> 2007-05-24 14:04:22,859 FATAL conf.Configuration - bad conf file: >> top-level element not <configuration> >> 2007-05-24 14:04:23,156 FATAL conf.Configuration - bad conf file: >> top-level element not <configuration> >> 2007-05-24 14:04:23,187 INFO fetcher.Fetcher - Fetcher: threads: 10 >> 2007-05-24 14:04:23,203 INFO plugin.PluginRepository - Plugins: looking >> in: C:\nutch-0.9\plugins >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Plugin >> Auto-activation mode: [true] >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Registered >> Plugins: >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - the nutch core >> extension points (nutch-extensionpoints) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - MSPowerPoint >> Parse Plug-in (parse-mspowerpoint) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Basic Query >> Filter (query-basic) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Basic Indexing >> Filter (index-basic) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Html Parse >> Plug-in (parse-html) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Pdf Parse >> Plug-in (parse-pdf) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Site Query >> Filter (query-site) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Jakarta POI - >> Java API To Access Microsoft Format Files (lib-jakarta-poi) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Text Parse >> Plug-in (parse-text) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - MSWord Parse >> Plug-in (parse-msword) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - SMB Protocol >> Plug-in (protocol-smb) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - MSExcel Parse >> Plug-in (parse-msexcel) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - OPIC Scoring >> Plug-in (scoring-opic) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - CyberNeko HTML >> Parser (lib-nekohtml) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Log4j >> (lib-log4j) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - File Protocol >> Plug-in (protocol-file) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - URL Query Filter >> (query-url) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Parse MS >> Documents Framework (lib-parsems) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Registered >> Extension-Points: >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Summarizer >> (org.apache.nutch.searcher.Summarizer) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch URL >> Normalizer (org.apache.nutch.net.URLNormalizer) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Protocol >> (org.apache.nutch.protocol.Protocol) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Analysis >> (org.apache.nutch.analysis.NutchAnalyzer) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch URL Filter >> (org.apache.nutch.net.URLFilter) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Indexing >> Filter (org.apache.nutch.indexer.IndexingFilter) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Online >> Search Results Clustering Plugin >> (org.apache.nutch.clustering.OnlineClusterer) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - HTML Parse >> Filter (org.apache.nutch.parse.HtmlParseFilter) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Content >> Parser (org.apache.nutch.parse.Parser) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Scoring >> (org.apache.nutch.scoring.ScoringFilter) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Query >> Filter (org.apache.nutch.searcher.QueryFilter) >> 2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Ontology Model >> Loader (org.apache.nutch.ontology.Ontology) >> 2007-05-24 14:04:23,390 INFO fetcher.Fetcher - fetching >> smb://sql1/Sales/DATA/ >> 2007-05-24 14:04:23,390 INFO fetcher.Fetcher - fetch of >> smb://sql1/Sales/DATA/ failed with: >> org.apache.nutch.protocol.ProtocolNotFound: >> java.net.MalformedURLException: unknown protocol: smb >> 2007-05-24 14:04:23,500 INFO fetcher.Fetcher - fetching >> file:///C:/Policies/ >> 2007-05-24 14:04:23,718 INFO crawl.SignatureFactory - Using Signature >> impl: org.apache.nutch.crawl.MD5Signature >> 2007-05-24 14:04:24,671 INFO plugin.PluginRepository - Plugins: looking >> in: C:\nutch-0.9\plugins >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Plugin >> Auto-activation mode: [true] >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Registered >> Plugins: >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - the nutch core >> extension points (nutch-extensionpoints) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - MSPowerPoint >> Parse Plug-in (parse-mspowerpoint) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Basic Query >> Filter (query-basic) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Basic Indexing >> Filter (index-basic) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Html Parse >> Plug-in (parse-html) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Pdf Parse >> Plug-in (parse-pdf) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Site Query >> Filter (query-site) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Jakarta POI - >> Java API To Access Microsoft Format Files (lib-jakarta-poi) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Text Parse >> Plug-in (parse-text) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - MSWord Parse >> Plug-in (parse-msword) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - SMB Protocol >> Plug-in (protocol-smb) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - MSExcel Parse >> Plug-in (parse-msexcel) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - OPIC Scoring >> Plug-in (scoring-opic) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - CyberNeko HTML >> Parser (lib-nekohtml) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Log4j >> (lib-log4j) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - File Protocol >> Plug-in (protocol-file) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - URL Query Filter >> (query-url) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Parse MS >> Documents Framework (lib-parsems) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Registered >> Extension-Points: >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Summarizer >> (org.apache.nutch.searcher.Summarizer) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch URL >> Normalizer (org.apache.nutch.net.URLNormalizer) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Protocol >> (org.apache.nutch.protocol.Protocol) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Analysis >> (org.apache.nutch.analysis.NutchAnalyzer) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch URL Filter >> (org.apache.nutch.net.URLFilter) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Indexing >> Filter (org.apache.nutch.indexer.IndexingFilter) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Online >> Search Results Clustering Plugin >> (org.apache.nutch.clustering.OnlineClusterer) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - HTML Parse >> Filter (org.apache.nutch.parse.HtmlParseFilter) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Content >> Parser (org.apache.nutch.parse.Parser) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Scoring >> (org.apache.nutch.scoring.ScoringFilter) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Query >> Filter (org.apache.nutch.searcher.QueryFilter) >> 2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Ontology Model >> Loader (org.apache.nutch.ontology.Ontology) >> 2007-05-24 14:04:25,171 INFO fetcher.Fetcher - Fetcher: done >> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: starting >> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: db: >> crawl/crawldb >> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: segments: >> [crawl/segments/20070524140420] >> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: additions >> allowed: true >> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: URL >> normalizing: true >> 2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: URL >> filtering: true >> 2007-05-24 14:04:25,203 FATAL conf.Configuration - bad conf file: >> top-level element not <configuration> >> 2007-05-24 14:04:25,203 INFO crawl.CrawlDb - CrawlDb update: Merging >> segment data into db. >> 2007-05-24 14:04:25,421 FATAL conf.Configuration - bad conf file: >> top-level element not <configuration> >> 2007-05-24 14:04:25,468 INFO plugin.PluginRepository - Plugins: looking >> in: C:\nutch-0.9\plugins >> 2007-05-24 14:04:25,593 INFO plugin.PluginRepository - Plugin >> Auto-activation mode: [true] >> >> >> Thank you for reading my post, hope you can help. >> >> Regards, >> >> Oli >> > > -- View this message in context: http://www.nabble.com/WIN-XP-PRO--Djava.protocol*-file%3A---c%3A-folder--Crawling-Parents-tf3809966.html#a10851108 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
