If I am not wrong, segments generated by Generator are some sort of
CrawlDatum.
I am putting metadata in the CrawlDb (I keep information that never
change) and I think they are copied to the segments by the Generator.

But now I want to access those metadata at the Parsing or Indexing step
to put some of them in the ParseData that were extracted (or directly in
the index).

I can't find a way to reassociate the "Content" and the Parse Object to
their respective CrawlDb/Segment.

Basically, I am trying to use CrawlDb as a database of metadata for
every URL and want to use them at the indexing step to enrich the
ParseData and then be able to search against them later on.

Stupid Example: I know this URL is associated to color "blue", but
doesn't have this information in the page pointed by this URL. Blue
would be kept in the metadata of the CrawlDb, then the
generator/fetch/parse steps are done as usual, but when indexing, blue
should be reassociated to the parsedata that has been extracted from the
page. 

Is it feasible without changing anything in nutch? (I use nutch as a
library more or lessand avoid changing stuff in it, I prefer redoing my
own injector/generator/fetcher/parser and formats etc... if needed).

I am going through all the different classes in nutch/hadoop now to
understand where stuff are and if they are read and in what kind of
object they are put.
Any pointer to shorten my reading is very welcome ;)

Thanks!

Reply via email to