[Nutch-general] Extracting the content of some crawled web page

Bruno Patini Furtado Thu, 24 Nov 2005 05:01:04 -0800

Hi,
I have the need of extracting the content of some web page crawled by Nutch.
The same functionality behind the cached link on the result pages of the
webapp that comes with this great project.


As I had to use a more complex query language than the one provided by nutch
I´m doing the queries directly to the lucene index using the lucene query
language.

As a side effect of this I have as my search results a lucene Hits class
instance. That cannot be used as a parameter to the operation getDetails:

package org.apache.nutch.searcher;
class NutchBean {
    ...
    public HitDetails getDetails(org.apache.nutch.searcher.Hit hit)...
}

That would return me a HitDetails instance by which I could get a URL
content using the operation getValues:

HitDetails details = nutch.getDetails(nutchHit);
String webPageContent = d.getValues("content")[0];

So my problem is:

   - how can I get the content of a crawled URL accessing directly the
   lucene index? or
    - how can I get a Nutch Hit object from a Lucene Hits object? or
    - is there any other way to retrieve the content of a crawled URL?

Any tip or suggestion will be most appreciated :)


--
"Minds are like parachutes, they work best when open."

Bruno Patini Furtado
Software Developer
webpage: www.bpfurtado.net
blog: http://www.livejournal.com/users/bpfurtado/

[Nutch-general] Extracting the content of some crawled web page

Reply via email to