If I understand what you are trying to do then here is how I would approach it.
Write an HtmlParseFilter that sets an attribute in the ParseData MetaData based on whether the page contains what you are looking for. Then write another MR job that runs after the crawl/index cycle. This job would need to update the CrawlDatum MetaData based on your priority calculation (inlinks and contains text, etc.). Then hack the Generator class around line 160 to change the sort value that it is using based on the CrawlDatum MetaData. I would make using this new sort value an option that you can turn on and off by using different configuration values. Hope this helps. Dennis Kubes Brian Whitman wrote: > On Feb 17, 2007, at 12:58 PM, Dennis Kubes wrote: > >> You can use an HtmlParseFilter and then set a metadata attribute as to >> whether or not it contains the phrase. Problem with this is that all >> of the content is still stored. You could also change the >> ParseOutputFormat to only write out if the word is contained although >> that is a bit of a hack. > > I'm not worried about a hack, our whole set up is very "der lauf der > dinge" and one more plank won't matter much :) But after sending my > question out, I realized that I would need to index the document anyway > before being able to lucene query it for topicality. I don't mind having > pages stored that don't match my query, but I really would rather the > generator not get more outlinks from those pages. > > So a simple fix would be something I can write or run after a > crawl/index cycle that can mark certain pages to not emit more URIs in > the generator. It would query each page in an index and update some > flag. But what is that flag and how can I get at it? > > And more advanced and later on -- the generator has smarts to prioritize > fetching by inlink counts-- is there something I can hack to "boost" > outlink fetches based on the source page's content? for example - I > find a page that scores high on my lucene query after crawl/index gets > done. I would want the generator to put all of its outlinks up top, even > if there's not many inlinks to that page... would this be a "generator > plugin?" > > -Brian > > > > > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
