Andrzej,

Based on what you suggested below, I have begun to write my own scoring plugin:

in distributeScoreToOutlinks() if the link contains the string im looking for, I set its score to kept_score and add a flag to the metaData in parseData ("KEEP", "true"). How do I check for this flag in generatorSortValue()? I only see a way to check the score, not a flag.

Thanks,

Eric


On Oct 7, 2009, at 2:48 AM, Andrzej Bialecki wrote:

Eric Osgood wrote:
Andrzej,
How would I check for a flag during fetch?

You would check for a flag during generation - please check ScoringFilter.generatorSortValue(), that's where you can check for a flag and set the sort value to Float.MIN_VALUE - this way the link will never be selected for fetching.

And you would put the flag in CrawlDatum metadata when ParseOutputFormat calls ScoringFilter.distributeScoreToOutlinks().

Maybe this explanation can shed some light:
Ideally, I would like to check the list of links for each page, but still needing a total of X links per page, if I find the links I want, I add them to the list up until X, if I don' reach X, I add other links until X is reached. This way, I don't waste crawl time on non-relevant links.

You can modify the collection of target links passed to distributeScoreToOutlinks() - this way you can affect both which links are stored and what kind of metadata each of them gets.

As I said, you can also use just plain URLFilters to filter out unwanted links, but that API gives you much less control because it's a simple yes/no that considers just URL string. The advantage is that it's much easier to implement than a ScoringFilter.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosg...@calpoly.edu, e...@lakemeadonline.com
---------------------------------------------
www.calpoly.edu/~eosgood, www.lakemeadonline.com

Reply via email to