Would it (the mail logic below) work for a recrawl? On Mon, May 21, 2012 at 2:43 PM, Vikas Hazrati <vi...@knoldus.com> wrote:
> Thanks Markus, usually for recrawling, I see that there are options which > do not use the bin/nutch crawl like "Can it recrawl section " in > http://wiki.apache.org/nutch/Crawl > > However, would there be a difference? other things to be kept in mind? if > we end up setting up a cron job on linux to say crawl every day and each > day we trigger something like > > bin/nutch crawl urls -dir arndme -depth 4 -topN 3 > > So we have the cron which calls this again and again. > > Regards | Vikas > > > > On Tue, May 15, 2012 at 6:07 PM, Markus Jelsma <markus.jel...@openindex.io > > wrote: > >> On Tuesday 15 May 2012 17:39:31 Vikas Hazrati wrote: >> > So once the crawl (which abstracts iterative crawls till the depth is >> > reached) is finished, is there a way to trigger a recrawl as well as a >> part >> > of some command line option so that Nutch continues to run as a daemon >> or >> > is shell script the way out? >> >> shell scripting is the way to go. Nutch will automatically recrawl pages >> that >> are due to be refetched. >> >> > >> > Regards | Vikas >> > >> > On Fri, May 11, 2012 at 8:26 PM, Lewis John Mcgibbney < >> > >> > lewis.mcgibb...@gmail.com> wrote: >> > > If you would like I could add you to the moderators group and you can >> > > word it how you wish. >> > > >> > > Please sign up to Jira, give me your Jira username on this page, and I >> > > will happily add you the the group. >> > > >> > > On the other-hand, if you don't wish to do this, then please reply >> > > here with your suggestion and I'll make sure something gets changed to >> > > accommodate your suggestions. >> > > >> > > Thanks >> > > >> > > On Fri, May 11, 2012 at 2:52 PM, Matthias Paul < >> magethle.nu...@gmail.com> >> > > >> > > wrote: >> > > > In was confused by this tutorial: >> > > http://wiki.apache.org/nutch/NutchTutorial >> > > >> > > > Reading this page one might get to the conclusion that the crawl >> tool >> > > > can't do iterative crawling, because under "3.2 Using Individual >> > > > Commands for Whole-Web Crawling" there's the sentence "This also >> > > > permits ... incremental crawling", as if the crawl command described >> > > > before (3.1 Using the Crawl Command) couldn't do that. >> > > > >> > > > Could someone perhaps improve this part of the tutorial? >> > > > >> > > > Matthias >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > On Thu, May 10, 2012 at 8:39 PM, Markus Jelsma >> > > > >> > > > <markus.jel...@openindex.io> wrote: >> > > >> By default each crawl is iterative. The crawl command is nothing >> more >> > > >> > > than a wrapper around the individual crawl cycle commands. The depth >> > > parameter is nothing more than executing a single crawl cycle multiple >> > > times. This is, if i am not mistaken, also true for older releases, >> > > certainly 1.2 and above. >> > > >> > > >> On Thu, 10 May 2012 19:31:27 +0100, Lewis John Mcgibbney < >> > > >> > > lewis.mcgibb...@gmail.com> wrote: >> > > >>> For the record, there is a patch pending review for Nutchgora >> which >> > > >>> will sort part of this for you as well. >> > > >>> >> > > >>> https://issues.apache.org/jira/browse/NUTCH-1301 >> > > >>> >> > > >>> Susam Pal also contributed a patch for Nutchgora regarding >> incremental >> > > >>> indexing but I can't find it just now sorry. >> > > >>> >> > > >>> Lewis >> > > >>> >> > > >>> >> > > >>> On Thu, May 10, 2012 at 5:18 PM, Matthias Paul >> > > >>> >> > > >>> <magethle.nu...@gmail.com> wrote: >> > > >>>> Hi all, >> > > >>>> >> > > >>>> can the crawl-command also be used for iterative crawls? >> > > >>>> In older Nutch-versions this was not possible but in 1.5 it >> seems to >> > > >> > > work? >> > > >> > > >>>> Thanks >> > > >>>> Matthias >> > > >> >> > > >> -- >> > > >> Markus Jelsma - CTO - Openindex >> > > >> > > -- >> > > Lewis >> -- >> Markus Jelsma - CTO - Openindex >> >> >