Thank you for the tip, I still can't solve my problem.
Let me explain in more details what I'm doing...

1. I created a file called 'urls.txt'. Put one url in it (e.g.
http://localhost/xxx/)
2. nutch admin db -create
3. nutch inject db urls.txt
4. nutch generate db segments
5. nutch fetch segments/<latest_segment>
6. nutch updatedb db segments/<latest_segment>

After repeating for, say, 2-3 times steps 4-6 and creating the index I then run:

* nutch inject db new_urls.txt (new_urls.txt contains something like
http://localhost/yyy/)
* nutch generate db segments
* nutch fetch segments/<latest_segment>

The fetcher still downloads urls from http://localhost/xxx/ (along
with those from http://localhost/yyy/), even if there are no links
between the two sites.

I can understand why it is behaving this way: I think the last
'generate' instruction takes all outgoing links from the latest
segment, isn't it?
But how can I 'force' nutch to consider only outgoing links from the
newly injected url?
A regex-urlfilter won't solve my problem, since this is a very simple
example and not a real  production scenario...

Thank you in advance,
Ennio

On 1/24/06, "Håvard W. Kongsgård" <[EMAIL PROTECTED]> wrote:
> If your "old urls" have not expired(30 day) then a bin/nutch generate
> will process only the new urls.
>
>
>
> Ennio Tosi wrote:
>
> >Hi, I created an index from an injected url. My problem is that if now
> >I inject another url in the webdb, the fetcher reprocesses the
> >starting url too... Is there a way to tell nutch to only process the
> >latest injected resource?
> >
> >Thanks,
> >Ennio
> >
> >
> >
> >
>
>


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to