One approach you can take is to add Metatags to webpages, and then extend
HtmlParseFilter using a custom plugin. You can add the metatags content to the
parse data in HtmlParseFilter extension and then extend IndexingFilter to add
the metatags to the index.
- Sathyam
Chris Hane <[EMAIL PROTECTED]> wrote:
I am looking to use nutch to crawl/index a website. A lot of the pages
have videos on them. We have transcripts for the videos that we would like
to be included for indexing; but we do not want to put the transcripts on
the web pages.
Is there a way to "add" this information to a given web page for purposes
of indexing as part of the crawl process? Maybe another point in the
process before the index is generated? I am hoping there is a point in the
crawl process where I can add augmented content to a page in the nutch
segment (rough thought based on very limited time spent looking at nutch).
We are comfortable using java and can write custom code as needed. I would
appreciate any pointers on where to look in the nutch code.
Thanks in advance,
Chris.....
---------------------------------
Yahoo! oneSearch: Finally, mobile search that gives answers, not web links. -------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general