Hi,

On 2/17/07, Brian Whitman <[EMAIL PROTECTED]> wrote:
>
> I'm not worried about a hack, our whole set up is very "der lauf der
> dinge" and one more plank won't matter much :) But after sending my
> question out, I realized that I would need to index the document
> anyway before being able to lucene query it for topicality. I don't
> mind having pages stored that don't match my query, but I really
> would rather the generator not get more outlinks from those pages.

How about an outlink filter that works during parse? In ParseOutputFormat,
it will take the parse text, parse data (etc.) of the source page and
the destination url then will either return "filter this outlink" or
"let it through".

>
> So a simple fix would be something I can write or run after a crawl/
> index cycle that can mark certain pages to not emit more URIs in the
> generator. It would query each page in an index and update some flag.
> But what is that flag and how can I get at it?
>
> And more advanced and later on -- the generator has smarts to
> prioritize fetching by inlink counts-- is there something I can hack
> to "boost" outlink fetches based on the source page's content?  for
> example - I find a page that scores high on my lucene query after
> crawl/index gets done. I would want the generator to put all of its
> outlinks up top, even if there's not many inlinks to that page...
> would this be a "generator plugin?"

You should be able to do this with a scoring plugin and a parse plugin.

Write a parse plugin (or update a current one) to analyze the content
and put the result in parse data's metadata(for example, put a
<"boost", "10"> pair in it). Then in
<your_scoring_filter>.distributeScoreToOutlink check if parse data's
metadata has the "boost" field and boost it accordingly. You may also
want to consider changing the indexerScore method to give it an even
higher boost.

>
> -Brian
>


-- 
Doğacan Güney
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to