Hi Zhanibek,

I would like to refer specifically to Markus' thread which he initiated a
short time ago [1] sharing close similarity to your own questions. I think
the main question to be answered now is how do we extract tf-idf from a
crawled website? And as we now refer to Nutch as an independent software
project focussed solely on crawling this is a question which would provide
significant value to understanding more about the inner workings.

Markus mentioned that there many aspects we need to consider before trying
to compile a tf-idf score e.g. link score, norms, boosts, functions etc.
This is making it relatively hard for me (and I suspect others) to
accurately comment on the actual components we are required to consider and
understand in this specific context before we can address the fundamental
question at hand...

I think there is a good deal of lateral thinking required here ;0)

In the mean time have you had any chance to delve into this?


[1] http://www.mail-archive.com/user%40nutch.apache.org/msg03517.html

On Wed, Aug 3, 2011 at 5:28 AM, Zhanibek Datbayev <[email protected]>wrote:

> Hello Nutch Users,
> I've googled for a while and still can not find answers to the following:
> 1. After I crawl a web site, how can I extract tf-idf for it?
> 2. How can I access original web pages crawled?
> 3. Is it possible to get for each word id it corresponds to?
>
> Thanks in advance!
>
> -Zhanibek
>



-- 
*Lewis*

Reply via email to