Andrzej,
Based on what you suggested below, I have begun to write my own
scoring plugin:
in distributeScoreToOutlinks() if the link contains the string im
looking for, I set its score to kept_score and add a flag to the
metaData in parseData ("KEEP", "true"). How do I check for this flag
in generatorSortValue()? I only see a way to check the score, not a
flag.
Thanks,
Eric
On Oct 7, 2009, at 2:48 AM, Andrzej Bialecki wrote:
Eric Osgood wrote:
Andrzej,
How would I check for a flag during fetch?
You would check for a flag during generation - please check
ScoringFilter.generatorSortValue(), that's where you can check for a
flag and set the sort value to Float.MIN_VALUE - this way the link
will never be selected for fetching.
And you would put the flag in CrawlDatum metadata when
ParseOutputFormat calls ScoringFilter.distributeScoreToOutlinks().
Maybe this explanation can shed some light:
Ideally, I would like to check the list of links for each page, but
still needing a total of X links per page, if I find the links I
want, I add them to the list up until X, if I don' reach X, I add
other links until X is reached. This way, I don't waste crawl time
on non-relevant links.
You can modify the collection of target links passed to
distributeScoreToOutlinks() - this way you can affect both which
links are stored and what kind of metadata each of them gets.
As I said, you can also use just plain URLFilters to filter out
unwanted links, but that API gives you much less control because
it's a simple yes/no that considers just URL string. The advantage
is that it's much easier to implement than a ScoringFilter.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosg...@calpoly.edu, e...@lakemeadonline.com
---------------------------------------------
www.calpoly.edu/~eosgood, www.lakemeadonline.com