thank you, I will give it a try :) On 4/25/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:
On 4/25/07, Charlie Williams <[EMAIL PROTECTED]> wrote: > I have an index of pages from the web, a bit over 1 million. The fetch took > several weeks to complete, since it was mainly over a small set of domains. > Once we had a completed fetch, and index we began trying to work with the > retrieved text, and found that the cached text is just that, flat text. Is > the original HTML cached anywhere that it can be accessed after the intial > fetch? It would be a shame to have to recrawl all those pages. We are using > Nutch .8 If you have fetcher.store.content set to true then Nutch has stored a copy of all the pages in <segment_dir>/content. You can extract individual contents with the command "./nutch readseg -get <segment_dir> <url> -noparse -nofetch -nogenerate -noparsetext -noparsedata". > > Thanks for any help. > > -Charlie > -- Doğacan Güney