On Feb 17, 2007, at 12:58 PM, Dennis Kubes wrote:

> You can use an HtmlParseFilter and then set a metadata attribute as  
> to whether or not it contains the phrase.  Problem with this is  
> that all of the content is still stored.  You could also change the  
> ParseOutputFormat to only write out if the word is contained  
> although that is a bit of a hack.

I'm not worried about a hack, our whole set up is very "der lauf der  
dinge" and one more plank won't matter much :) But after sending my  
question out, I realized that I would need to index the document  
anyway before being able to lucene query it for topicality. I don't  
mind having pages stored that don't match my query, but I really  
would rather the generator not get more outlinks from those pages.

So a simple fix would be something I can write or run after a crawl/ 
index cycle that can mark certain pages to not emit more URIs in the  
generator. It would query each page in an index and update some flag.  
But what is that flag and how can I get at it?

And more advanced and later on -- the generator has smarts to  
prioritize fetching by inlink counts-- is there something I can hack  
to "boost" outlink fetches based on the source page's content?  for  
example - I find a page that scores high on my lucene query after  
crawl/index gets done. I would want the generator to put all of its  
outlinks up top, even if there's not many inlinks to that page...  
would this be a "generator plugin?"

-Brian







-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to