You can write index plugins. Please first read the (slighlty outdated) tutorial and then check http://wiki.apache.org/nutch/PluginCentral. Optionally you may want to write html parse plugins depending on the source of the data.
Chris Hane wrote: > I am looking to use nutch to crawl/index a website. A lot of the > pages have videos on them. We have transcripts for the videos that we > would like to be included for indexing; but we do not want to put the > transcripts on the web pages. > > Is there a way to "add" this information to a given web page for > purposes of indexing as part of the crawl process? Maybe another > point in the process before the index is generated? I am hoping there > is a point in the crawl process where I can add augmented content to a > page in the nutch segment (rough thought based on very limited time > spent looking at nutch). > > We are comfortable using java and can write custom code as needed. I > would appreciate any pointers on where to look in the nutch code. > > Thanks in advance, > Chris..... > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
