Re: Problem crawling local filesystem

ohaya Thu, 16 Jul 2009 10:54:56 -0700

Hi,

I think that I've found my problem.


It looks like that line in the "urls" file:

file:///testfiles/ <file:///testfiles/>

should have been:

file:///testfiles/

I originally had the bad one because I had followed the link that I sent, but I 
don't know why he included the "<...>" part :(...

I'm re-running it now.  I don't have the patch/fix to prevent crawling the 
parent directory, yet, so it's taking awhile.

Jim



---- [email protected] wrote: 
> Hi,
> 
> I'm trying to setup a test using Nutch to crawl the local file system.  This 
> is on a Redhat system.  I'm basically following the procedure in these links:
> 
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> 
> http://markmail.org/message/pnmqd7ypguh7qtit
> 
> Here's my command line:
> 
> bin/nutch crawl fs-urls -dir acrawlfs1.test -depth 4 >& acrawlfs1.log
> 
> 
> Here's what I get in the Nutch log file:
> 
> [r...@nssdemo nutch-1.0]# cat acrawlfs1.log
> crawl started in: acrawlfs1.test
> rootUrlDir = fs-urls
> threads = 10
> depth = 4
> Injector: starting
> Injector: crawlDb: acrawlfs1.test/crawldb
> Injector: urlDir: fs-urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: acrawlfs1.test/segments/20090716101523
> Generator: filtering: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Bad protocol in url:
> Bad protocol in url: #file:///data/readings/semanticweb/
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: acrawlfs1.test/segments/20090716101523
> Fetcher: threads: 10
> QueueFeeder finished: total 1 records.
> fetching file:///testfiles/ <file:///testfiles/>
> -finishing thread FetcherThread, activeThreads=9
> -finishing thread FetcherThread, activeThreads=8
> -finishing thread FetcherThread, activeThreads=7
> -finishing thread FetcherThread, activeThreads=6
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=1
> org.apache.nutch.protocol.file.FileError: File Error: 404
>         at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92)
>         at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)
> fetch of file:///testfiles/ <file:///testfiles/> failed with: 
> org.apache.nutch.protocol.file.FileError: File Error: 404
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: acrawlfs1.test/crawldb
> CrawlDb update: segments: [acrawlfs1.test/segments/20090716101523]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: acrawlfs1.test/segments/20090716101532
> Generator: filtering: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Bad protocol in url:
> Bad protocol in url: #file:///data/readings/semanticweb/
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=1 - no more URLs to fetch.
> LinkDb: starting
> LinkDb: linkdb: acrawlfs1.test/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: 
> file:/opt/nutch-1.0/acrawlfs1.test/segments/20090716101523
> LinkDb: done
> Indexer: starting
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: acrawlfs1.test/indexes
> Dedup: done
> merging indexes to: acrawlfs1.test/index
> Adding file:/opt/nutch-1.0/acrawlfs1.test/indexes/part-00000
> done merging
> crawl finished: acrawlfs1.test
> 
> Here's my conf/nutch-site.xml:
> 
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> <!-- Put site-specific property overrides in this file. -->
> 
> <configuration>
> 
> 
> <property>
> <name>plugin.includes</name>
> <value>protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)</value>
> </property>
> <property>
> <name>file.content.limit</name> <value>-1</value>
> </property>
> 
> </configuration>
> 
> 
> and, my crawl-urlfilter.txt:
> 
> [r...@nssdemo nutch-1.0]# cat conf/crawl-urlfilter.txt
> #skip http:, ftp:, & mailto: urls
> ##-^(file|ftp|mailto):
> 
> -^(http|ftp|mailto):
> 
> #skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
> 
> #skip URLs containing certain characters as probable queries, etc.
> -[...@=]
> 
> #accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> 
> #accecpt anything else
> +.*
> 
> And, in fs-urls, I have urls:
> 
> file:///testfiles/ <file:///testfiles/>
> 
> #file:///data/readings/semanticweb/
> 
> For this test, I have a /testfiles directory, with a bunch of .txt files 
> under two directories /testfiles/Content1 and /testfiles/Content2.
> 
> It looks like the crawl goes to the end, and creates the directories and 
> files under acrawlfs1.test, but when I run Luke on the index directory, I got 
> an error, with a popup window with just "0" in it.
> 
> Is the problem because of that 404 error in the log?  If so, why am I getting 
> that 404 error?
> 
> Thanks,
> Jim

Re: Problem crawling local filesystem

Reply via email to