Dennis Kubes wrote:
I think that I am not fully understanding the role the segments directory and its contents play.

A segment is simply a set of urls fetched in the same round, and data associated with these urls. The content subdirectory contains the raw http content. The parse-text subdirectory contains the extracted text, used when indexing and when building snippets for hits. The index subdirectory holds a Lucene index of the pages in the segment. Etc. It is an independent chunk of Nutch data.

In 0.8, each segment subdirectory is further split into parts, the result of distributed processing. The parts are split by the hash of the url.

Does that help?

Doug


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to