If I am not wrong, segments generated by Generator are some sort of CrawlDatum. I am putting metadata in the CrawlDb (I keep information that never change) and I think they are copied to the segments by the Generator.
But now I want to access those metadata at the Parsing or Indexing step to put some of them in the ParseData that were extracted (or directly in the index). I can't find a way to reassociate the "Content" and the Parse Object to their respective CrawlDb/Segment. Basically, I am trying to use CrawlDb as a database of metadata for every URL and want to use them at the indexing step to enrich the ParseData and then be able to search against them later on. Stupid Example: I know this URL is associated to color "blue", but doesn't have this information in the page pointed by this URL. Blue would be kept in the metadata of the CrawlDb, then the generator/fetch/parse steps are done as usual, but when indexing, blue should be reassociated to the parsedata that has been extracted from the page. Is it feasible without changing anything in nutch? (I use nutch as a library more or lessand avoid changing stuff in it, I prefer redoing my own injector/generator/fetcher/parser and formats etc... if needed). I am going through all the different classes in nutch/hadoop now to understand where stuff are and if they are read and in what kind of object they are put. Any pointer to shorten my reading is very welcome ;) Thanks!