Hi, I've built a custom HtmlParseFilter and am doing custom language identification in the filter() API. Here, I am able to set the relevant lang id properties on a ParseResult object via getParseMeta().put("LangId", id). I am also able to retrieve these properties in my custom ScoringFilter, for use during distributeScoreToOutlinks(). However, what I also need is to persist this data as as metadata in the relevant CrawlDb record (i.e., in the CrawlDatum.getMetadata() data structure). My intent, from all this, is finally to be able to write custom Hadoop jobs to gather language distribution statistics directy from the Crawldb (without having to do any joins on the ParseText, Content, ParseData types). The only way I see this being possible, is if each URL's CrawlDatum also has the lang-id in its metadata.
This is turning out to be a challenge. I first tried transfering the parse properties in my custom ScoringFilter.distributeScoreToOutlinks() API, because that API offers access to the ParseResut as well as an "adjust" CrawlDatum parameter for updating the CrawlDb (according to the Javadocs). However, doing that is not updating the crawldb. Then, in the newsgroup archives, I stumbled upon a thread about the "db.max.outlinks.per.page"property being used by the ParseOutputFormat class to do exactly the same property transfer at a different stage of the crawl cycle, but that doesn't work either. So, I'm writing to the newsgroup hoping someone could give me specific advice on which API I should override, or which configuration setting I should change, so as to transfer custom parse-time metadata to the CrawlDb. Thanks in advance. Cheers, Safdar