Lukas, Thanks for your e-mail. I assumed I could drop the $depth number of oldest segments because I first merged them all into one segment (which I don't drop). Am I incorrect in my assumption and can this cause problems in the future? If so, then I'll go back to the original version of my script when I kept all the segments without merging. However, it just seemed like if that is the case, it will be a problem after enough number of recrawls due to the large amount of segments being kept.
Thanks, Matt Lukas Vlcek wrote: > Hi Matthew, > > I am surious about one thing. How do you know you can just drop $depth > number of the most oldest segments in the end? I haven't studied nutch > code regarding this topic yet but I thought that segment can be > dropped once you are sure that all its content is already crawled in > some newer segments (which should be checked somehow via some > function/script - which hasen't been yet implemented to my knowledge). > > Also I don't think this question has been discussed on dev/user lists > in detail yet so I just wanted to ask you about your opinion. The > situation could get even more complicated if people add -topN > parameter into script (which can happen because some might prefer > crawling in ten smaller bunches over to two huge crawls due to various > technical reasons). > > Anyway, never mind if you don't want to bother about my silly question > :-) > > Regards, > Lukas > > On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote: >> Last email regarding this script. I found a bug in it that is sporadic >> (i think it only affected different setups). However, since it would be >> a problem sometimes, I refactored the script. I'd suggest you redownload >> the script if you are using it. >> >> Matt >> >> Matthew Holt wrote: >> > I'm currently pretty busy at work. If I have I'll do it later. >> > >> > The version 0.8 recrawl script has a working version online now. I >> > temporarily modified it on the website yesterday when I ran into some >> > problems, but I further tested it and the actual working code is >> > modified now. So if you got it off the web site any time yesterday, I >> > would redownload the script. >> > >> > Matt >> > >> > Lourival JĂșnior wrote: >> >> Hi Matthew! >> >> >> >> Could you update the script to the version 0.7.2 with the same >> >> functionalities? I write a scritp that do this, but it don't work >> very >> >> well... >> >> >> >> Regards! >> >> >> >> On 8/2/06, Matthew Holt <[EMAIL PROTECTED]> wrote: >> >>> >> >>> Just letting everyone know that I updated the recrawl script on the >> >>> Wiki. It now merges the created segments them deletes the old >> segs to >> >>> prevent a lot of unneeded data remaining/growing on the hard drive. >> >>> Matt >> >>> >> >>> >> >>> >> http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03 >> >> >> >>> >> >>> >> >> >> >> >> >> >> > >> > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
