I have made a small Lucene client reading my nutch index created with
Nutch-0.9
This works fine. However since 'content' is not stored only indexed in
the index I have to find a way to access the content to create a summary
(and highlighting the query terms).
I think I have found that the 'content' is cached and I have to retrive
the text from the segmentdb. Is there a way to retrive this content text
form the segment and still have the ability to use Lucene as client?
I have debugged the Nutch code while doing a query and I found out that
class FetchedSegments has the following method
public Summary getSummary(HitDetails details, Query query) throws
IOException {
if (this.summarizer == null) { return new Summary(); }
String text =
getSegment(details).getParseText(getUrl(details)).getText();
return this.summarizer.getSummary(text, query);
}
Where getSegment(details).getParseText(getUrl(details)).getText();
seems to do the trick by returning the actual content but this also do
not seem like an easy way for me to do it from a Lucene perspective.
Is this the only way to access the content, or is there other ways
aswell?
Also, I have a java 1.4 restriction so using the Nutch webclient is not
an option.
Best regards,
Ronny
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general