I think at the parser plugin level, you can't get back to the original
crawldatum. The parsers get only the Content.
What I did is putting stuff from the Crawldb in the Content MetaData at
fetch time. Then the Parser gets this Metadata and can put it in the
Parse object as needed.

If you do fetching and parsing in a single shot, the Fetcher class could
put directly the info from the crawldatum into the parse.


-----Original Message-----
From: Enis Soztutar [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 30, 2006 1:07 AM
To: nutch-dev@lucene.apache.org
Subject: Re: Use CrawlDb as a metadata Db?

HUYLEBROECK Jeremy RD-ILAB-SSF wrote:
> If I am not wrong, segments generated by Generator are some sort of 
> CrawlDatum.
> I am putting metadata in the CrawlDb (I keep information that never
> change) and I think they are copied to the segments by the Generator.
>
> But now I want to access those metadata at the Parsing or Indexing 
> step to put some of them in the ParseData that were extracted (or 
> directly in the index).
>
> I can't find a way to reassociate the "Content" and the Parse Object 
> to their respective CrawlDb/Segment.
>
> Basically, I am trying to use CrawlDb as a database of metadata for 
> every URL and want to use them at the indexing step to enrich the 
> ParseData and then be able to search against them later on.
>
> Stupid Example: I know this URL is associated to color "blue", but 
> doesn't have this information in the page pointed by this URL. Blue 
> would be kept in the metadata of the CrawlDb, then the 
> generator/fetch/parse steps are done as usual, but when indexing, blue

> should be reassociated to the parsedata that has been extracted from 
> the page.
>
> Is it feasible without changing anything in nutch? (I use nutch as a 
> library more or lessand avoid changing stuff in it, I prefer redoing 
> my own injector/generator/fetcher/parser and formats etc... if
needed).
>
> I am going through all the different classes in nutch/hadoop now to 
> understand where stuff are and if they are read and in what kind of 
> object they are put.
> Any pointer to shorten my reading is very welcome ;)
>
> Thanks!
>
>
>   
hi,

The CrawlDatum keeps crawl status information about every url that is
fetched. The class has a metedata field which is an instance of
MapWritable, behaving similar to a HashMap. Thus I have used the
metadata field for similar purposes. For example in the fetcher, you can
set some property like :

datum.getMetaData().put(<key>,<value>);

and than in the indexing plugin you could retrieve it with :  
datum.getMetaData().get(<key>);





Reply via email to