Well when Nutch undertakes a generate fetch and parse e.g. the steps that generate segment data for indexing, the data is stored in various forms within the segment. There is much more purpose to the segment that explained in this reply however it does not add to this particular thread.
If you have a look at nutch-default.xml you will noticed a deprecated property db.default.fetch.interval, ignore this for the time being and focus instead on db.fetch.interval.default (which is a much more accurate method of specifying default value for re-fetches of any given page anyway), any segment older than this value can be safely deleted as new segments will have been created in successive crawl processes thus rendering it less useful to us. This is one option for reducing the amount of memory Nutch data takes on disk. An alternative option to this is to mergesegs with the option to pass filtering and slicing commands for a healthier output segment. I remeber learning on this list some time ago that mergesegs is a useful command for managing a Nutch instance which produces several segments per day. Understandably this can get out of hand pretty quickly therefore merging segment data enables us to manage this effectively. In general, but strictly dependant on the size and nature of your Nutch crawls, we rarely experience problems concerning the size of disk space occupied by >= Nutch 1.3 segment data, however I'm sure there are extreme cases out there. On Thu, Jul 28, 2011 at 9:18 AM, Chris Alexander <chris.alexan...@kusiri.com > wrote: > Cheers Lewis, perhaps I should attempt to rephrase the question. > > Clearly Nutch must download and store the contents of a page during a > crawl. > However, once you have indexed this content, does Nutch keep this data, or > is it cleaned up, automatically or is there a command to do it? > > Thanks > > Chris > > On 27 July 2011 17:14, lewis john mcgibbney <lewis.mcgibb...@gmail.com > >wrote: > > > HI Alexander, > > > > I don't want to state the obvious here but this will depend directly on > > what > > type of loading your Nutch implementation deals with... > > > > You are correct in stating that we store data in segments, namely > > /crawl_fetch > > /content > > /crawl_parse > > /parse_data > > /crawl_generate > > /parse_text > > > > I understand that this doesn't add much value to answering your question, > > but as we are now indexing with Solr (and therefore not storing larger > > amounts of data with Nutch) I am struggling slightly to understand the > > issues you are trying to answer. > > > > > > > > > > On Mon, Jul 25, 2011 at 5:13 PM, Chris Alexander < > > chris.alexan...@kusiri.com > > > wrote: > > > > > Hi all, > > > > > > I have been asked to look at doing some disk space estimates for our > > Nutch > > > usage. It looks like Nutch stores the content of the pages it downloads > > and > > > indexes in its data directory for the segment, is this the case? > > > > > > Are there any other major storage requirements I should make not of > with > > > Nutch specifically (not the Solr storage, we can handle that bit)? > > > > > > Cheers > > > > > > Chris > > > > > > > > > > > -- > > *Lewis* > > > -- *Lewis*