You could do something like this:
bin/nutch readseg -dump $NUTCH_HOME/crawl/segment/SEGNAME OUTPUT_DIR/
-nocontent -nogenerate -noparse -noparsedata -noparsetext
this will print a file called 'dump' to OUTPUT_DIR/ containing the fetcher
data only. Each entry will look something like:
Recno:: 4
URL:: http://www.examplepage.com
CrawlDatum::
Version: 4
Status: 5 (fetch_success)
Fetch time: Tue Nov 07 22:54:09 JST 2006
Modified time: Thu Jan 01 09:00:00 JST 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: 71fc0f7885a5766980c785a72934dcb0
Metadata: null
You could then grab the urls based on the 'Status' value. Dumping only the
content will lead to something similar. If there is a faster way, please
let me know!
Check out bin/nutch readseg
cheers!
On 11/13/06, Bryan Woliner <[EMAIL PROTECTED]> wrote:
Hi,
When I was using nutch 0.7, I found the bin/nutch fetchlist -dumpurls
command to be very useful. However, I have not been able to find an
equivalent command in nutch 0.8.x.
Essentially all I want to do is dump all urls stored in a certain segment
(or group of segments) into a text file.
In nutch 0.7.x I would call a command like this:
*$ bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls $s1 >foo.txt
*Any suggestions for how this can be accomplished in nutch 0.8.x are very
much appreciated.
Thanks,
Bryan
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general