You could do something like this:

bin/nutch readseg -dump $NUTCH_HOME/crawl/segment/SEGNAME OUTPUT_DIR/
-nocontent -nogenerate -noparse -noparsedata -noparsetext

this will print a file called 'dump' to OUTPUT_DIR/ containing the fetcher
data only.  Each entry will look something like:

Recno:: 4
URL:: http://www.examplepage.com

CrawlDatum::
Version: 4
Status: 5 (fetch_success)
Fetch time: Tue Nov 07 22:54:09 JST 2006
Modified time: Thu Jan 01 09:00:00 JST 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: 71fc0f7885a5766980c785a72934dcb0
Metadata: null

You could then grab the urls based on the 'Status' value.  Dumping only the
content will lead to something similar.  If there is a faster way, please
let me know!

Check out bin/nutch readseg

cheers!


On 11/13/06, Bryan Woliner <[EMAIL PROTECTED]> wrote:

Hi,

When I was using nutch 0.7, I found the bin/nutch fetchlist -dumpurls
command to be very useful. However, I have not been able to find an
equivalent command in nutch 0.8.x.

Essentially all I want to do is dump all urls stored in a certain segment
(or group of segments) into a text file.

In nutch 0.7.x I would call a command like this:

*$ bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls $s1 >foo.txt

*Any suggestions for how this can be accomplished in nutch 0.8.x are very
much appreciated.

Thanks,
Bryan


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to