Re: [Nutch-general] Forcing refetch and index of specified files

Tomi NA Fri, 22 Sep 2006 05:22:07 -0700

On 9/21/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Benjamin Higgins wrote:
> > How can I instruct Nutch to refetch specific files and then update the
> > index
> > entries for those files?
> >
> > I am indexing files on a fileserver and I am able to produce a report of
> > changed files about every 30 minutes.
> >
> > I'd like to feed that into Nutch at approximately the same interval so
> > I can
> > keep the index up-to-date.
> >
> > Thanks.
>
> Conceptually this should be easy - you just need to generate a fetchlist
> directly from your list of changed files, and not through
> injecting/generating from a crawldb.
>
> I wrote a tool for 0.7 which does this - look at the NUTCH-68 issue in
> JIRA. This would have to be ported to 0.8 - check how Injector does this
> in the first stage, when it converts a simple text file to a MapFile.


Would an algorithm like this make any sense:
for each URL in txt file
  if URL in crawldb
    update the date to "now()+1" in it's crawl datum
  else
    use existing inject logic to inject the new url

After that, it's only a matter of running the recrawl script with -adddays 0.

t.n.a.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Forcing refetch and index of specified files

Reply via email to