[Nutch-general] Re: How to get Text and Parse data for URL

Doug Cutting Tue, 25 Apr 2006 15:27:59 -0700

Dennis Kubes wrote:

I think that I am not fully understanding the rolethe segments directory and its contents play.

A segment is simply a set of urls fetched in the same round, and dataassociated with these urls. The content subdirectory contains the rawhttp content. The parse-text subdirectory contains the extracted text,used when indexing and when building snippets for hits. The indexsubdirectory holds a Lucene index of the pages in the segment. Etc. Itis an independent chunk of Nutch data.

In 0.8, each segment subdirectory is further split into parts, theresult of distributed processing. The parts are split by the hash ofthe url.


Does that help?

Doug


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: How to get Text and Parse data for URL

Reply via email to