Lukas,
   Thanks for your e-mail. I assumed I could drop the $depth number of 
oldest segments because I first merged them all into one segment (which 
I don't drop). Am I incorrect in my assumption and can this cause 
problems in the future? If so, then I'll go back to the original version 
of my script when I kept all the segments without merging. However, it 
just seemed like if that is the case, it will be a problem after enough 
number of recrawls due to the large amount of segments being kept.

 Thanks,
  Matt

Lukas Vlcek wrote:
> Hi Matthew,
>
> I am surious about one thing. How do you know you can just drop $depth
> number of the most oldest segments in the end? I haven't studied nutch
> code regarding this topic yet but I thought that segment can be
> dropped once you are sure that all its content is already crawled in
> some newer segments (which should be checked somehow via some
> function/script - which hasen't been yet implemented to my knowledge).
>
> Also I don't think this question has been discussed on dev/user lists
> in detail yet so I just wanted to ask you about your opinion. The
> situation could get even more complicated if people add -topN
> parameter into script (which can happen because some might prefer
> crawling in ten smaller bunches over to two huge crawls due to various
> technical reasons).
>
> Anyway, never mind if you don't want to bother about my silly question 
> :-)
>
> Regards,
> Lukas
>
> On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
>> Last email regarding this script. I found a bug in it that is sporadic
>> (i think it only affected different setups). However, since it would be
>> a problem sometimes, I refactored the script. I'd suggest you redownload
>> the script if you are using it.
>>
>> Matt
>>
>> Matthew Holt wrote:
>> > I'm currently pretty busy at work. If I have I'll do it later.
>> >
>> > The version 0.8 recrawl script has a working version online now. I
>> > temporarily modified it on the website yesterday when I ran into some
>> > problems, but I further tested it and the actual working code is
>> > modified now. So if you got it off the web site any time yesterday, I
>> > would redownload the script.
>> >
>> > Matt
>> >
>> > Lourival JĂșnior wrote:
>> >> Hi Matthew!
>> >>
>> >> Could you update the script to the version 0.7.2 with the same
>> >> functionalities? I write a scritp that do this, but it don't work 
>> very
>> >> well...
>> >>
>> >> Regards!
>> >>
>> >> On 8/2/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
>> >>>
>> >>> Just letting everyone know that I updated the recrawl script on the
>> >>> Wiki. It now merges the created segments them deletes the old 
>> segs to
>> >>> prevent a lot of unneeded data remaining/growing on the hard drive.
>> >>>   Matt
>> >>>
>> >>>
>> >>> 
>> http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03
>>  
>>
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >
>>
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to