I am looking to use nutch to crawl/index a website. A lot of the pages have videos on them. We have transcripts for the videos that we would like to be included for indexing; but we do not want to put the transcripts on the web pages.
Is there a way to "add" this information to a given web page for purposes of indexing as part of the crawl process? Maybe another point in the process before the index is generated? I am hoping there is a point in the crawl process where I can add augmented content to a page in the nutch segment (rough thought based on very limited time spent looking at nutch). We are comfortable using java and can write custom code as needed. I would appreciate any pointers on where to look in the nutch code. Thanks in advance, Chris..... ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
