thank you, I will give it a try :)

On 4/25/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:

On 4/25/07, Charlie Williams <[EMAIL PROTECTED]> wrote:
> I have an index of pages from the web, a bit over 1 million. The fetch
took
> several weeks to complete, since it was mainly over a small set of
domains.
> Once we had a completed fetch, and index we began trying to work with
the
> retrieved text, and found that the cached text is just that, flat text.
Is
> the original HTML cached anywhere that it can be accessed after the
intial
> fetch? It would be a shame to have to recrawl all those pages. We are
using
> Nutch  .8

If you have fetcher.store.content set to true then Nutch has stored a
copy of all the pages in <segment_dir>/content. You can extract
individual contents with the command "./nutch readseg -get
<segment_dir> <url> -noparse -nofetch -nogenerate -noparsetext
-noparsedata".

>
> Thanks for any help.
>
> -Charlie
>


--
Doğacan Güney

Reply via email to