Hi All, I hope you can help as I am becomming rather depressed with Nutch on
Windows.
Using: Windows XP Pro SP2 - Nutch-0.9 - Cygwin [current version from cygwin
site] - Java JDK 1.6.0 - Java Platform Standard Edition 1.6.0
I cannot stop Nutch from crawling parent directories, I have looked at other
threads and none seem to work.
I have tried to include protocol-smb [jcifs] but Cygwin keeps prompting for
Java syntax corrections.
Below I have listed my configurations along with the command I type in
cygwin for jcifs:
CRAWL-URLFILTER
# The url filter file used by the crawl command.
# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file:, ftp:, & mailto: urls
-^(http|ftp|mailto):
+^(file|smb):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept hosts in MY.DOMAIN.NAME
# Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^file:///C:/Policies/
# skip everything else
-.
NUTCH-SITE
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<nutch-conf>
<property>
<name>http.agent.name</name>
<value>pascall</value>
<description></description>
</property>
<property>
<name>file.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be
truncated;
otherwise, no truncation at all.
</description>
</property>
<property>
<name>file.crawl.parent</name>
<value>false</value>
<description>The crawler is not restricted to the directories that you
specified in the
Urls file but it is jumping into the parent directories as well. For
your own crawlings you can
change this bahavior (set to false) the way that only directories
beneath the directories that you specify get
crawled.</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-file|protocol-smb|scoring-opic|parse-(msexcel|mspowerpoint|msword|xml|text|html|pdf)|index-basic|query-(basic|site|url)</value>
</property>
</nutch-conf>
CYGWIN
Using cygwin I enter the command from C:\Sun\Java\jdk160\bin\
java -Djava.protocol.handler.pkgs=jcifs
When I press return the cygwin shell displays a list of java commands as
though I am using incorrect syntax.
Dump of Crawl from Cygwin:
2007-05-24 14:04:16,140 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:16,171 INFO crawl.Crawl - crawl started in: crawl
2007-05-24 14:04:16,171 INFO crawl.Crawl - rootUrlDir = urls.txt
2007-05-24 14:04:16,171 INFO crawl.Crawl - threads = 10
2007-05-24 14:04:16,171 INFO crawl.Crawl - depth = 5
2007-05-24 14:04:16,281 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:16,281 INFO crawl.Injector - Injector: starting
2007-05-24 14:04:16,281 INFO crawl.Injector - Injector: crawlDb:
crawl/crawldb
2007-05-24 14:04:16,296 INFO crawl.Injector - Injector: urlDir: urls.txt
2007-05-24 14:04:16,296 INFO crawl.Injector - Injector: Converting injected
urls to crawl db entries.
2007-05-24 14:04:16,328 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:16,843 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:16,953 INFO plugin.PluginRepository - Plugins: looking in:
C:\nutch-0.9\plugins
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Registered Plugins:
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - MSPowerPoint
Parse
Plug-in (parse-mspowerpoint)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Basic Query
Filter
(query-basic)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Html Parse
Plug-in
(parse-html)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Pdf Parse
Plug-in
(parse-pdf)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Site Query
Filter
(query-site)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Jakarta POI -
Java
API To Access Microsoft Format Files (lib-jakarta-poi)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Text Parse
Plug-in
(parse-text)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - MSWord Parse
Plug-in (parse-msword)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - SMB Protocol
Plug-in (protocol-smb)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - MSExcel Parse
Plug-in (parse-msexcel)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Log4j
(lib-log4j)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - File Protocol
Plug-in (protocol-file)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - URL Query Filter
(query-url)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Parse MS
Documents
Framework (lib-parsems)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Registered
Extension-Points:
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Online
Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - HTML Parse
Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Nutch Query
Filter
(org.apache.nutch.searcher.QueryFilter)
2007-05-24 14:04:17,156 INFO plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-05-24 14:04:17,875 INFO crawl.Injector - Injector: Merging injected
urls into crawl db.
2007-05-24 14:04:17,906 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:18,156 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:18,375 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2007-05-24 14:04:19,281 INFO crawl.Injector - Injector: done
2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: starting
2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: segment:
crawl/segments/20070524140420
2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: filtering: false
2007-05-24 14:04:20,281 INFO crawl.Generator - Generator: topN: 2147483647
2007-05-24 14:04:20,312 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:20,312 INFO crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2007-05-24 14:04:20,562 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:20,609 INFO plugin.PluginRepository - Plugins: looking in:
C:\nutch-0.9\plugins
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Registered Plugins:
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - MSPowerPoint
Parse
Plug-in (parse-mspowerpoint)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Basic Query
Filter
(query-basic)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Html Parse
Plug-in
(parse-html)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Pdf Parse
Plug-in
(parse-pdf)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Site Query
Filter
(query-site)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Jakarta POI -
Java
API To Access Microsoft Format Files (lib-jakarta-poi)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Text Parse
Plug-in
(parse-text)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - MSWord Parse
Plug-in (parse-msword)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - SMB Protocol
Plug-in (protocol-smb)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - MSExcel Parse
Plug-in (parse-msexcel)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Log4j
(lib-log4j)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - File Protocol
Plug-in (protocol-file)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - URL Query Filter
(query-url)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Parse MS
Documents
Framework (lib-parsems)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Registered
Extension-Points:
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Online
Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - HTML Parse
Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Nutch Query
Filter
(org.apache.nutch.searcher.QueryFilter)
2007-05-24 14:04:20,781 INFO plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-05-24 14:04:20,796 WARN crawl.PartitionUrlByHost - Malformed URL:
'smb://sql1/Sales/DATA/'
2007-05-24 14:04:20,843 INFO plugin.PluginRepository - Plugins: looking in:
C:\nutch-0.9\plugins
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Registered Plugins:
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - MSPowerPoint
Parse
Plug-in (parse-mspowerpoint)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Basic Query
Filter
(query-basic)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Html Parse
Plug-in
(parse-html)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Pdf Parse
Plug-in
(parse-pdf)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Site Query
Filter
(query-site)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Jakarta POI -
Java
API To Access Microsoft Format Files (lib-jakarta-poi)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Text Parse
Plug-in
(parse-text)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - MSWord Parse
Plug-in (parse-msword)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - SMB Protocol
Plug-in (protocol-smb)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - MSExcel Parse
Plug-in (parse-msexcel)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Log4j
(lib-log4j)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - File Protocol
Plug-in (protocol-file)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - URL Query Filter
(query-url)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Parse MS
Documents
Framework (lib-parsems)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Registered
Extension-Points:
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Online
Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - HTML Parse
Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Nutch Query
Filter
(org.apache.nutch.searcher.QueryFilter)
2007-05-24 14:04:21,000 INFO plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-05-24 14:04:21,578 INFO crawl.Generator - Generator: Partitioning
selected urls by host, for politeness.
2007-05-24 14:04:21,593 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:21,828 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:21,859 INFO plugin.PluginRepository - Plugins: looking in:
C:\nutch-0.9\plugins
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Registered Plugins:
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - MSPowerPoint
Parse
Plug-in (parse-mspowerpoint)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Basic Query
Filter
(query-basic)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Html Parse
Plug-in
(parse-html)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Pdf Parse
Plug-in
(parse-pdf)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Site Query
Filter
(query-site)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Jakarta POI -
Java
API To Access Microsoft Format Files (lib-jakarta-poi)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Text Parse
Plug-in
(parse-text)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - MSWord Parse
Plug-in (parse-msword)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - SMB Protocol
Plug-in (protocol-smb)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - MSExcel Parse
Plug-in (parse-msexcel)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Log4j
(lib-log4j)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - File Protocol
Plug-in (protocol-file)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - URL Query Filter
(query-url)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Parse MS
Documents
Framework (lib-parsems)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Registered
Extension-Points:
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Online
Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - HTML Parse
Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Nutch Query
Filter
(org.apache.nutch.searcher.QueryFilter)
2007-05-24 14:04:22,000 INFO plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-05-24 14:04:22,000 WARN crawl.PartitionUrlByHost - Malformed URL:
'smb://sql1/Sales/DATA/'
2007-05-24 14:04:22,843 INFO crawl.Generator - Generator: done.
2007-05-24 14:04:22,843 INFO fetcher.Fetcher - Fetcher: starting
2007-05-24 14:04:22,843 INFO fetcher.Fetcher - Fetcher: segment:
crawl/segments/20070524140420
2007-05-24 14:04:22,859 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:23,156 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:23,187 INFO fetcher.Fetcher - Fetcher: threads: 10
2007-05-24 14:04:23,203 INFO plugin.PluginRepository - Plugins: looking in:
C:\nutch-0.9\plugins
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Registered Plugins:
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - MSPowerPoint
Parse
Plug-in (parse-mspowerpoint)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Basic Query
Filter
(query-basic)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Html Parse
Plug-in
(parse-html)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Pdf Parse
Plug-in
(parse-pdf)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Site Query
Filter
(query-site)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Jakarta POI -
Java
API To Access Microsoft Format Files (lib-jakarta-poi)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Text Parse
Plug-in
(parse-text)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - MSWord Parse
Plug-in (parse-msword)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - SMB Protocol
Plug-in (protocol-smb)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - MSExcel Parse
Plug-in (parse-msexcel)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Log4j
(lib-log4j)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - File Protocol
Plug-in (protocol-file)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - URL Query Filter
(query-url)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Parse MS
Documents
Framework (lib-parsems)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Registered
Extension-Points:
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Online
Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - HTML Parse
Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Nutch Query
Filter
(org.apache.nutch.searcher.QueryFilter)
2007-05-24 14:04:23,343 INFO plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-05-24 14:04:23,390 INFO fetcher.Fetcher - fetching
smb://sql1/Sales/DATA/
2007-05-24 14:04:23,390 INFO fetcher.Fetcher - fetch of
smb://sql1/Sales/DATA/ failed with:
org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException:
unknown protocol: smb
2007-05-24 14:04:23,500 INFO fetcher.Fetcher - fetching
file:///C:/Policies/
2007-05-24 14:04:23,718 INFO crawl.SignatureFactory - Using Signature impl:
org.apache.nutch.crawl.MD5Signature
2007-05-24 14:04:24,671 INFO plugin.PluginRepository - Plugins: looking in:
C:\nutch-0.9\plugins
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Registered Plugins:
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - MSPowerPoint
Parse
Plug-in (parse-mspowerpoint)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Basic Query
Filter
(query-basic)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Html Parse
Plug-in
(parse-html)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Pdf Parse
Plug-in
(parse-pdf)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Site Query
Filter
(query-site)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Jakarta POI -
Java
API To Access Microsoft Format Files (lib-jakarta-poi)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Text Parse
Plug-in
(parse-text)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - MSWord Parse
Plug-in (parse-msword)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - SMB Protocol
Plug-in (protocol-smb)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - MSExcel Parse
Plug-in (parse-msexcel)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Log4j
(lib-log4j)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - File Protocol
Plug-in (protocol-file)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - URL Query Filter
(query-url)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Parse MS
Documents
Framework (lib-parsems)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Registered
Extension-Points:
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Online
Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - HTML Parse
Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Nutch Query
Filter
(org.apache.nutch.searcher.QueryFilter)
2007-05-24 14:04:24,812 INFO plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-05-24 14:04:25,171 INFO fetcher.Fetcher - Fetcher: done
2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: starting
2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: db:
crawl/crawldb
2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: segments:
[crawl/segments/20070524140420]
2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: additions
allowed: true
2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: URL
normalizing: true
2007-05-24 14:04:25,171 INFO crawl.CrawlDb - CrawlDb update: URL filtering:
true
2007-05-24 14:04:25,203 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:25,203 INFO crawl.CrawlDb - CrawlDb update: Merging
segment data into db.
2007-05-24 14:04:25,421 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:25,468 INFO plugin.PluginRepository - Plugins: looking in:
C:\nutch-0.9\plugins
2007-05-24 14:04:25,593 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
Thank you for reading my post, hope you can help.
Regards,
Oli
--
View this message in context:
http://www.nabble.com/WIN-XP-PRO--Djava.protocol*-file%3A---c%3A-folder--Crawling-Parents-tf3809966.html#a10783382
Sent from the Nutch - User mailing list archive at Nabble.com.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general