Hi, I think that I've found my problem.
It looks like that line in the "urls" file: file:///testfiles/ <file:///testfiles/> should have been: file:///testfiles/ I originally had the bad one because I had followed the link that I sent, but I don't know why he included the "<...>" part :(... I'm re-running it now. I don't have the patch/fix to prevent crawling the parent directory, yet, so it's taking awhile. Jim ---- [email protected] wrote: > Hi, > > I'm trying to setup a test using Nutch to crawl the local file system. This > is on a Redhat system. I'm basically following the procedure in these links: > > http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch > > http://markmail.org/message/pnmqd7ypguh7qtit > > Here's my command line: > > bin/nutch crawl fs-urls -dir acrawlfs1.test -depth 4 >& acrawlfs1.log > > > Here's what I get in the Nutch log file: > > [r...@nssdemo nutch-1.0]# cat acrawlfs1.log > crawl started in: acrawlfs1.test > rootUrlDir = fs-urls > threads = 10 > depth = 4 > Injector: starting > Injector: crawlDb: acrawlfs1.test/crawldb > Injector: urlDir: fs-urls > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: acrawlfs1.test/segments/20090716101523 > Generator: filtering: true > Generator: jobtracker is 'local', generating exactly one partition. > Bad protocol in url: > Bad protocol in url: #file:///data/readings/semanticweb/ > Generator: Partitioning selected urls by host, for politeness. > Generator: done. > Fetcher: starting > Fetcher: segment: acrawlfs1.test/segments/20090716101523 > Fetcher: threads: 10 > QueueFeeder finished: total 1 records. > fetching file:///testfiles/ <file:///testfiles/> > -finishing thread FetcherThread, activeThreads=9 > -finishing thread FetcherThread, activeThreads=8 > -finishing thread FetcherThread, activeThreads=7 > -finishing thread FetcherThread, activeThreads=6 > -finishing thread FetcherThread, activeThreads=5 > -finishing thread FetcherThread, activeThreads=4 > -finishing thread FetcherThread, activeThreads=3 > -finishing thread FetcherThread, activeThreads=2 > -finishing thread FetcherThread, activeThreads=1 > org.apache.nutch.protocol.file.FileError: File Error: 404 > at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535) > fetch of file:///testfiles/ <file:///testfiles/> failed with: > org.apache.nutch.protocol.file.FileError: File Error: 404 > -finishing thread FetcherThread, activeThreads=0 > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=0 > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: acrawlfs1.test/crawldb > CrawlDb update: segments: [acrawlfs1.test/segments/20090716101523] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: acrawlfs1.test/segments/20090716101532 > Generator: filtering: true > Generator: jobtracker is 'local', generating exactly one partition. > Bad protocol in url: > Bad protocol in url: #file:///data/readings/semanticweb/ > Generator: 0 records selected for fetching, exiting ... > Stopping at depth=1 - no more URLs to fetch. > LinkDb: starting > LinkDb: linkdb: acrawlfs1.test/linkdb > LinkDb: URL normalize: true > LinkDb: URL filter: true > LinkDb: adding segment: > file:/opt/nutch-1.0/acrawlfs1.test/segments/20090716101523 > LinkDb: done > Indexer: starting > Indexer: done > Dedup: starting > Dedup: adding indexes in: acrawlfs1.test/indexes > Dedup: done > merging indexes to: acrawlfs1.test/index > Adding file:/opt/nutch-1.0/acrawlfs1.test/indexes/part-00000 > done merging > crawl finished: acrawlfs1.test > > Here's my conf/nutch-site.xml: > > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > <!-- Put site-specific property overrides in this file. --> > > <configuration> > > > <property> > <name>plugin.includes</name> > <value>protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)</value> > </property> > <property> > <name>file.content.limit</name> <value>-1</value> > </property> > > </configuration> > > > and, my crawl-urlfilter.txt: > > [r...@nssdemo nutch-1.0]# cat conf/crawl-urlfilter.txt > #skip http:, ftp:, & mailto: urls > ##-^(file|ftp|mailto): > > -^(http|ftp|mailto): > > #skip image and other suffixes we can't yet parse > -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ > > #skip URLs containing certain characters as probable queries, etc. > -[...@=] > > #accept hosts in MY.DOMAIN.NAME > #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ > > #accecpt anything else > +.* > > And, in fs-urls, I have urls: > > file:///testfiles/ <file:///testfiles/> > > #file:///data/readings/semanticweb/ > > For this test, I have a /testfiles directory, with a bunch of .txt files > under two directories /testfiles/Content1 and /testfiles/Content2. > > It looks like the crawl goes to the end, and creates the directories and > files under acrawlfs1.test, but when I run Luke on the index directory, I got > an error, with a popup window with just "0" in it. > > Is the problem because of that 404 error in the log? If so, why am I getting > that 404 error? > > Thanks, > Jim
