Re: Crawl-tool for iterative crawling?

Vikas Hazrati Mon, 21 May 2012 11:03:20 -0700

Would it (the mail logic below) work for a recrawl?

On Mon, May 21, 2012 at 2:43 PM, Vikas Hazrati <vi...@knoldus.com> wrote:


> Thanks Markus, usually for recrawling, I see that there are options which
> do not use the bin/nutch crawl like "Can it recrawl section " in
> http://wiki.apache.org/nutch/Crawl
>
> However, would there be a difference? other things to be kept in mind? if
> we end up setting up a cron job on linux to say crawl every day and each
> day we trigger something like
>
>  bin/nutch crawl urls -dir arndme -depth 4 -topN 3
>
> So we have the cron which calls this again and again.
>
> Regards | Vikas
>
>
>
> On Tue, May 15, 2012 at 6:07 PM, Markus Jelsma <markus.jel...@openindex.io
> > wrote:
>
>> On Tuesday 15 May 2012 17:39:31 Vikas Hazrati wrote:
>> > So once the crawl (which abstracts iterative crawls till the depth is
>> > reached) is finished, is there a way to trigger a recrawl as well as a
>> part
>> > of some command line option so that Nutch continues to run as a daemon
>> or
>> > is shell script the way out?
>>
>> shell scripting is the way to go. Nutch will automatically recrawl pages
>> that
>> are due to be refetched.
>>
>> >
>> > Regards | Vikas
>> >
>> > On Fri, May 11, 2012 at 8:26 PM, Lewis John Mcgibbney <
>> >
>> > lewis.mcgibb...@gmail.com> wrote:
>> > > If you would like I could add you to the moderators group and you can
>> > > word it how you wish.
>> > >
>> > > Please sign up to Jira, give me your Jira username on this page, and I
>> > > will happily add you the the group.
>> > >
>> > > On the other-hand, if you don't wish to do this, then please reply
>> > > here with your suggestion and I'll make sure something gets changed to
>> > > accommodate your suggestions.
>> > >
>> > > Thanks
>> > >
>> > > On Fri, May 11, 2012 at 2:52 PM, Matthias Paul <
>> magethle.nu...@gmail.com>
>> > >
>> > > wrote:
>> > > > In was confused by this tutorial:
>> > > http://wiki.apache.org/nutch/NutchTutorial
>> > >
>> > > > Reading this page one might get to the conclusion that the crawl
>> tool
>> > > > can't do iterative crawling, because under "3.2 Using Individual
>> > > > Commands for Whole-Web Crawling" there's  the sentence "This also
>> > > > permits ... incremental crawling", as if the crawl command described
>> > > > before (3.1 Using the Crawl Command) couldn't do that.
>> > > >
>> > > > Could someone perhaps improve this part of the tutorial?
>> > > >
>> > > > Matthias
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > On Thu, May 10, 2012 at 8:39 PM, Markus Jelsma
>> > > >
>> > > > <markus.jel...@openindex.io> wrote:
>> > > >> By default each crawl is iterative. The crawl command is nothing
>> more
>> > >
>> > > than a wrapper around the individual crawl cycle commands. The depth
>> > > parameter is nothing more than executing a single crawl cycle multiple
>> > > times. This is, if i am not mistaken, also true for older releases,
>> > > certainly 1.2 and above.
>> > >
>> > > >> On Thu, 10 May 2012 19:31:27 +0100, Lewis John Mcgibbney <
>> > >
>> > > lewis.mcgibb...@gmail.com> wrote:
>> > > >>> For the record, there is a patch pending review for Nutchgora
>> which
>> > > >>> will sort part of this for you as well.
>> > > >>>
>> > > >>> https://issues.apache.org/jira/browse/NUTCH-1301
>> > > >>>
>> > > >>> Susam Pal also contributed a patch for Nutchgora regarding
>> incremental
>> > > >>> indexing but I can't find it just now sorry.
>> > > >>>
>> > > >>> Lewis
>> > > >>>
>> > > >>>
>> > > >>> On Thu, May 10, 2012 at 5:18 PM, Matthias Paul
>> > > >>>
>> > > >>> <magethle.nu...@gmail.com> wrote:
>> > > >>>> Hi all,
>> > > >>>>
>> > > >>>> can the crawl-command also be used for iterative crawls?
>> > > >>>> In older Nutch-versions this was not possible but in 1.5 it
>> seems to
>> > >
>> > > work?
>> > >
>> > > >>>> Thanks
>> > > >>>> Matthias
>> > > >>
>> > > >> --
>> > > >> Markus Jelsma - CTO - Openindex
>> > >
>> > > --
>> > > Lewis
>> --
>> Markus Jelsma - CTO - Openindex
>>
>>
>

Re: Crawl-tool for iterative crawling?

Reply via email to