Re: [Fwd: Re: 0.8 Recrawl script updated]

Matthew Holt Tue, 08 Aug 2006 12:00:12 -0700

Since it wasn't really clear whether my script approached the problem ofdeleting segments correctly, I refactored it so it generates the newnumber of segments, merges them into one, then deletes the "new"segments. Not as efficient disk space wise, but still removes a largenumber of the segments that are not being referenced by anything due tonot being indexed yet.

I reupdated the wiki. Unless there is any more clarification regardingthe issue, hopefully I won't have to bombard your inbox with any moreemails regarding this.


Matt

Lukas Vlcek wrote:

Hi again,

I just found related discussion here:
http://www.nabble.com/NullPointException-tf2045994r1.html

I think these guys are discussing similar problem and if I understood
the conclusion correctly then the only solution right now is to write
some code and test which segments are used in index and which are not.

Regards,
Lukas

On 8/4/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
Matthew,

In fact I didn't realize you are doing merge stuff (sorry for that)
but frankly I don't know how exactly merging works and if this
strategy would work in the long time perspective and whether it is
universal approach in all variability of cases which may occur during
crawling (-topN, threads frozen, pages unavailable, crawling dies, ...
etc), may be it is correct path. I would appreciate if anybody can
answer this question precisely.

Thanks,
Lukas

On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> If anyone doesnt mind taking a look...
>
>
>
> ---------- Forwarded message ----------
> From: Matthew Holt <[EMAIL PROTECTED]>
> To: nutch-user@lucene.apache.org
> Date: Fri, 04 Aug 2006 10:07:57 -0400
> Subject: Re: 0.8 Recrawl script updated
> Lukas,
>    Thanks for your e-mail. I assumed I could drop the $depth number of
> oldest segments because I first merged them all into one segment(which
> I don't drop). Am I incorrect in my assumption and can this cause
> problems in the future? If so, then I'll go back to the originalversion
> of my script when I kept all the segments without merging. However, it
> just seemed like if that is the case, it will be a problem afterenough
> number of recrawls due to the large amount of segments being kept.
>
>  Thanks,
>   Matt
>
> Lukas Vlcek wrote:
> > Hi Matthew,
> >
> > I am surious about one thing. How do you know you can just drop$depth> > number of the most oldest segments in the end? I haven't studiednutch
> > code regarding this topic yet but I thought that segment can be
> > dropped once you are sure that all its content is already crawled in
> > some newer segments (which should be checked somehow via some
> > function/script - which hasen't been yet implemented to myknowledge).
> >
> > Also I don't think this question has been discussed on dev/userlists
> > in detail yet so I just wanted to ask you about your opinion. The
> > situation could get even more complicated if people add -topN
> > parameter into script (which can happen because some might prefer
> > crawling in ten smaller bunches over to two huge crawls due tovarious
> > technical reasons).
> >
> > Anyway, never mind if you don't want to bother about my sillyquestion
> > :-)
> >
> > Regards,
> > Lukas
> >
> > On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> >> Last email regarding this script. I found a bug in it that issporadic> >> (i think it only affected different setups). However, since itwould be> >> a problem sometimes, I refactored the script. I'd suggest youredownload
> >> the script if you are using it.
> >>
> >> Matt
> >>
> >> Matthew Holt wrote:
> >> > I'm currently pretty busy at work. If I have I'll do it later.
> >> >
> >> > The version 0.8 recrawl script has a working version onlinenow. I> >> > temporarily modified it on the website yesterday when I raninto some
> >> > problems, but I further tested it and the actual working code is
> >> > modified now. So if you got it off the web site any timeyesterday, I
> >> > would redownload the script.
> >> >
> >> > Matt
> >> >
> >> > Lourival Júnior wrote:
> >> >> Hi Matthew!
> >> >>
> >> >> Could you update the script to the version 0.7.2 with the same
> >> >> functionalities? I write a scritp that do this, but it don'twork
> >> very
> >> >> well...
> >> >>
> >> >> Regards!
> >> >>
> >> >> On 8/2/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> >> >>>
> >> >>> Just letting everyone know that I updated the recrawl scripton the
> >> >>> Wiki. It now merges the created segments them deletes the old
> >> segs to
> >> >>> prevent a lot of unneeded data remaining/growing on the harddrive.
> >> >>>   Matt
> >> >>>
> >> >>>
> >> >>>
> >>http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03
> >>
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >
> >>
> >
>
>
>
>

Re: [Fwd: Re: 0.8 Recrawl script updated]

Reply via email to