Re: Storage of data between crawls

lewis john mcgibbney Thu, 28 Jul 2011 09:21:45 -0700

Well when Nutch undertakes a generate fetch and parse e.g. the steps that
generate segment data for indexing, the data is stored in various forms
within the segment. There is much more purpose to the segment that explained
in this reply however it does not add to this particular thread.

If you have a look at nutch-default.xml you will noticed a deprecated
property db.default.fetch.interval, ignore this for the time being and focus
instead on db.fetch.interval.default (which is a much more accurate method
of specifying default value for re-fetches of any given page anyway), any
segment older than this value can be safely deleted as new segments will
have been created in successive crawl processes thus rendering it less
useful to us. This is one option for reducing the amount of memory Nutch
data takes on disk.

An alternative option to this is to mergesegs with the option to pass
filtering and slicing commands for a healthier output segment. I remeber
learning on this list some time ago that mergesegs is a useful command for
managing a Nutch instance which produces several segments per day.
Understandably this can get out of hand pretty quickly therefore merging
segment data enables us to manage this effectively.

In general, but strictly dependant on the size and nature of your Nutch
crawls, we rarely experience problems concerning the size of disk space
occupied by >= Nutch 1.3 segment data, however I'm sure there are extreme
cases out there.

On Thu, Jul 28, 2011 at 9:18 AM, Chris Alexander <[email protected]
> wrote:

> Cheers Lewis, perhaps I should attempt to rephrase the question.
>
> Clearly Nutch must download and store the contents of a page during a
> crawl.
> However, once you have indexed this content, does Nutch keep this data, or
> is it cleaned up, automatically or is there a command to do it?
>
> Thanks
>
> Chris
>
> On 27 July 2011 17:14, lewis john mcgibbney <[email protected]
> >wrote:
>
> > HI Alexander,
> >
> > I don't want to state the obvious here but this will depend directly on
> > what
> > type of loading your Nutch implementation deals with...
> >
> > You are correct in stating that we store data in segments, namely
> > /crawl_fetch
> > /content
> > /crawl_parse
> > /parse_data
> > /crawl_generate
> > /parse_text
> >
> > I understand that this doesn't add much value to answering your question,
> > but as we are now indexing with Solr (and therefore not storing larger
> > amounts of data with Nutch) I am struggling slightly to understand the
> > issues you are trying to answer.
> >
> >
> >
> >
> > On Mon, Jul 25, 2011 at 5:13 PM, Chris Alexander <
> > [email protected]
> > > wrote:
> >
> > > Hi all,
> > >
> > > I have been asked to look at doing some disk space estimates for our
> > Nutch
> > > usage. It looks like Nutch stores the content of the pages it downloads
> > and
> > > indexes in its data directory for the segment, is this the case?
> > >
> > > Are there any other major storage requirements I should make not of
> with
> > > Nutch specifically (not the Solr storage, we can handle that bit)?
> > >
> > > Cheers
> > >
> > > Chris
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>

-- 
*Lewis*

Re: Storage of data between crawls

Reply via email to