Hi Check the db.parsemeta.to.crawldb parameter. It'll send your parse meta keys to the CrawlDatum meta data.
Cheers -----Original message----- > From:Safdar Kureishy <safdar.kurei...@gmail.com> > Sent: Wed 29-Aug-2012 21:26 > To: user@nutch.apache.org > Subject: Need to transfer Parse metadata obtained in HtmlParseFilter.filter() > to the CrawlDb > > Hi, > > I've built a custom HtmlParseFilter and am doing custom language > identification in the filter() API. Here, I am able to set the relevant > lang id properties on a ParseResult object via getParseMeta().put("LangId", > id). I am also able to retrieve these properties in my custom > ScoringFilter, for use during distributeScoreToOutlinks(). However, what I > also need is to persist this data as as metadata in the relevant CrawlDb > record (i.e., in the CrawlDatum.getMetadata() data structure). My intent, > from all this, is finally to be able to write custom Hadoop jobs to gather > language distribution statistics directy from the Crawldb (without having > to do any joins on the ParseText, Content, ParseData types). The only way I > see this being possible, is if each URL's CrawlDatum also has the lang-id > in its metadata. > > This is turning out to be a challenge. I first tried transfering the parse > properties in my custom ScoringFilter.distributeScoreToOutlinks() API, > because that API offers access to the ParseResut as well as an "adjust" > CrawlDatum parameter for updating the CrawlDb (according to the Javadocs). > However, doing that is not updating the crawldb. Then, in the newsgroup > archives, I stumbled upon a thread about the > "db.max.outlinks.per.page"property being used by the ParseOutputFormat > class to do exactly the same > property transfer at a different stage of the crawl cycle, but that doesn't > work either. > > So, I'm writing to the newsgroup hoping someone could give me specific > advice on which API I should override, or which configuration setting I > should change, so as to transfer custom parse-time metadata to the CrawlDb. > > Thanks in advance. > > Cheers, > Safdar >