Hi Andrzej,

I am sorry for the late reply. I haven't had a chance to prepare these
dump files for you yet but I made one interesting observation which
could shed some light on the problem.
I turned on http and fetcher verbose logging and it seems that all
these three urls redirects fetcher to the same page.

I have a lot of un_fetched url links in database but lot of them does
not point to any real document (as the original document is gone) and
the server redirects to the default page (home page or "can't find
this page" ... etc). Do you think this information could help us now?

Anyway, I'll try to prepare those dump files for you (I don't have
much experience with segread command so far). However, I tied the
newest SVN nutch-.08 today with the same result.

My current settings for redirects:
<name>http.redirect.max</name>
<value>3</value>

Regards,
Lukas

On 5/17/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
Lukas Vlcek wrote:
> Hi Andrzej,
>
> nutch-site.xml says:
> <name>db.default.fetch.interval</name>
> <value>15</value>
>
> I tried readdb -dump.
> I am not an expert in dump output but to me it seems that db is not
> updated.
> I have two dump output (pre and post) and diffing then I found the
> following differencies:
> 1) Some score values were changed.
> 2) Only one fetch time for one document has been changed but that is
> not any of that three fatched pages...
>
> I also checked these three pages and they are still unfetched.
>
> Wow that seems very strange...
> Any idea?

Ok, this could indicate some bugs in either Generate or CrawlDbReducer
(both of which has been recently changes in a couple places). Could you
please do the following:

* prepare a fragment of the crawldb dump with the data about these three
pages.

* generate, so that you get these three pages in the fetchlist (easy to
check with segread).

* fetch

* prepare a fragment of the segment dump (segread -dump) with the data
about these pages

* run updatedb

* prepare a fragment of the crawldb dump after updating

And then package this data nicely and send them to me. Thanks!

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Reply via email to