Hi,

I've built a custom HtmlParseFilter and am doing custom language
identification in the filter() API. Here, I am able to set the relevant
lang id properties on a ParseResult object via getParseMeta().put("LangId",
id). I am also able to retrieve these properties in my custom
ScoringFilter, for use during distributeScoreToOutlinks(). However, what I
also need is to persist this data as as metadata in the relevant CrawlDb
record (i.e., in the CrawlDatum.getMetadata() data structure). My intent,
from all this, is finally to be able to write custom Hadoop jobs to gather
language distribution statistics directy from the Crawldb (without having
to do any joins on the ParseText, Content, ParseData types). The only way I
see this being possible, is if each URL's CrawlDatum also has the lang-id
in its metadata.

This is turning out to be a challenge. I first tried transfering the parse
properties in my custom ScoringFilter.distributeScoreToOutlinks() API,
because that API offers access to the ParseResut as well as an "adjust"
CrawlDatum parameter for updating the CrawlDb (according to the Javadocs).
However, doing that is not updating the crawldb. Then, in the newsgroup
archives, I stumbled upon a thread about the
"db.max.outlinks.per.page"property being used by the ParseOutputFormat
class to do exactly the same
property transfer at a different stage of the crawl cycle, but that doesn't
work either.

So, I'm writing to the newsgroup hoping someone could give me specific
advice on which API I should override, or which configuration setting I
should change, so as to transfer custom parse-time metadata to the CrawlDb.

Thanks in advance.

Cheers,
Safdar

Reply via email to