Hi Ben,
Attached is a method I use to achieve the process that you are after.
Each changed url must be on it's own line in the txt file.
#Remove updated URL's
exec 0</data/urls-update.txt
while read url
do
echo "$url"
sleep 5
$nutch_dir/bin/nutch org.apache.nutch.db.WebDBWriter
/data/crawls/<crawl_dir>/db -deletepage $url
Done
As you can see it pulls a list of updated urls from a file and removes
them one by one from the db. These can then be re-inserted using the
following
#Insert Updated
$nutch_dir/bin/nutch inject /data/crawls/<crawl_dir>/db -urlfile
/data/urls-update.txt
Hope this helps
Gary
-----Original Message-----
From: Benjamin Higgins [mailto:[EMAIL PROTECTED]
Sent: 20 October 2006 19:41
To: [email protected]
Subject: Re-injecting URLS, perhaps by removing them from the CrawlDB
first?
Hello,
I'd like to remove specific URLs from the CrawlDB. I want to do this so
I can inject them again, but have them marked as not yet crawled.
I want to do this since I have a set of URLs that I know have been
updated, and want them to be refetched right away.
If anyone knows how to do this, or of a better way to go about this,
PLEASE let me know. I am having some difficulty determining the best
way to modify the code to accomplish this task.
Thank you.
Ben
CAUTION - This message may contain privileged and confidential information
intended only for the use of the addressee named above. If you are not the
intended recipient of this message you are hereby notified that any use,
dissemination, distribution or reproduction of this message is prohibited. If
you have received this message in error please notify SPG Media Group Plc
immediately via email at [EMAIL PROTECTED] Any views expressed in this message
are those of the individual sender and may not necessarily reflect the views of
SPG Media Group PLC
This email has been scanned by SPG's Email Security System.
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general