Hi

Check the db.parsemeta.to.crawldb parameter. It'll send your parse meta keys to 
the CrawlDatum meta data.

Cheers

 
 
-----Original message-----
> From:Safdar Kureishy <safdar.kurei...@gmail.com>
> Sent: Wed 29-Aug-2012 21:26
> To: user@nutch.apache.org
> Subject: Need to transfer Parse metadata obtained in HtmlParseFilter.filter() 
> to the CrawlDb
> 
> Hi,
> 
> I've built a custom HtmlParseFilter and am doing custom language
> identification in the filter() API. Here, I am able to set the relevant
> lang id properties on a ParseResult object via getParseMeta().put("LangId",
> id). I am also able to retrieve these properties in my custom
> ScoringFilter, for use during distributeScoreToOutlinks(). However, what I
> also need is to persist this data as as metadata in the relevant CrawlDb
> record (i.e., in the CrawlDatum.getMetadata() data structure). My intent,
> from all this, is finally to be able to write custom Hadoop jobs to gather
> language distribution statistics directy from the Crawldb (without having
> to do any joins on the ParseText, Content, ParseData types). The only way I
> see this being possible, is if each URL's CrawlDatum also has the lang-id
> in its metadata.
> 
> This is turning out to be a challenge. I first tried transfering the parse
> properties in my custom ScoringFilter.distributeScoreToOutlinks() API,
> because that API offers access to the ParseResut as well as an "adjust"
> CrawlDatum parameter for updating the CrawlDb (according to the Javadocs).
> However, doing that is not updating the crawldb. Then, in the newsgroup
> archives, I stumbled upon a thread about the
> "db.max.outlinks.per.page"property being used by the ParseOutputFormat
> class to do exactly the same
> property transfer at a different stage of the crawl cycle, but that doesn't
> work either.
> 
> So, I'm writing to the newsgroup hoping someone could give me specific
> advice on which API I should override, or which configuration setting I
> should change, so as to transfer custom parse-time metadata to the CrawlDb.
> 
> Thanks in advance.
> 
> Cheers,
> Safdar
> 

Reply via email to