[Nutch-general] Partial crawls.

Lyndon Maydwell Sun, 01 Jul 2007 18:01:58 -0700

Hi,

I'm a new user to nutch and am wondering about seeding the database by
running a crawl with  a very shallow depth, then growing the database
every time the periodic update script is done. I have two scripts that
I'm currently using, but I'm not sure if the update script is actually
adding searchable data. The initial crawl script is doing a great job,
and I can verify that it is working by using the search app that comes
with nutch, but my maintenance script doesn't seem to be adding any
results, although it throws no errors.


Below are the two small scripts. Am I missing any simple errors?

-- initial crawl script << END1 --

#!/bin/sh
./../bin/nutch crawl urls -dir crawl -depth 2 -topN 10000

END1

-- updater script << END2 --

first="crawl"
second="100000"

../bin/nutch generate $first/crawldb $first/segments -topN $second

segment=`ls -d $first/segments/* | tail -1 | grep "[a-zA-Z0-9/]*"`

../bin/nutch fetch       $segment

../bin/nutch updatedb    $first/crawldb $segment

rm -r $first/indexes

../bin/nutch invertlinks $first/linkdb  $first/segments/*

../bin/nutch index       $first/indexes $first/crawldb
$first/linkdb $first/segments/*

END2

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Partial crawls.

Reply via email to