I just use the Java build-in regex features... and therefore just supplied the string, which I design for my case using RegexBuddy a really great tool by the way.
Pay attention though at static creation in order to avoid regex creation at each plug-in load and run-time hit. Didn't find a way to modify inlinks... on the other hand inlinks you have gone through already when you are evaluating a given page so I did not bother and it works fine for me, I don't see why it should work for others. 2009/11/28 Eran Zinman <[email protected]> > Thanks for your help MillBii! > > I will definitely try the squareroot option - but is that only valid for > outlinks or also affects pages linking to the page? > > Did you try implementing automatic Regex generation? I'm doing focused > crawling but I'm also thinking about scaling it in the future. > > Also I will be happy to know if anyone else have any other suggestion (or > already implemented strategy) - I think this issue affects most of the > Nutch > community - at least people that use Nutch for focused crawling. > > Thanks, > Eran > > On Fri, Nov 27, 2009 at 8:29 PM, MilleBii <[email protected]> wrote: > > > Well I have created for my own application is topical-scoring plugin : > > > > 1. first I needed to score the pages after parsing based on my regular > > expression > > > > 2. then I searched several options on to how boost score of that pages... > I > > have only found a way to boost the score of the outlinks of these pages > > that > > have content which I wanted. Not perfect but so be it there is a high > > likelyhood in my case that adjacent pages have also content which I want. > > > > 3. then how to boost the score... this took me a while to figure out, I > > leave you all the options I tried. The good comprise I found is the > > following: > > if the page has content I want and score < 1.0f than score= > > squareroot(score)... in this way you are adding weight to the pages which > > have content you are looking (since score is usually below 1. > > squareroot(x) > > is bigger than x). > > > > Of course there are some down side to that approach, it is more difficult > > to > > get the crawler to go outsides sites that have content your are looking > > for, > > it is a bit like digging a hole and until you have finished the hole it > > will > > get the crawler to explore it... experimentally I have found that it > works > > nicely for me though, if you limit the nbre of URLS per site it won't > spend > > it's life on them. > > > > We could try to generalize this plug-in by putting the regular expression > > as > > as config item because that is really the only thing which is specific to > > my > > application I believe. > > > > > > > > 2009/11/27 Eran Zinman <[email protected]> > > > > > Hi all, > > > > > > I'm try to figure out ways to improve Nutch focused crawling > efficiency. > > > > > > I'm looking for certain pages inside each domain which contains content > > I'm > > > looking for. > > > > > > I'm unable to know that a certain URL contains what I'm looking for > > unless > > > I > > > parse it and do some analysis on it. > > > > > > Basically I was thinking about two methods to improve crawling > > efficiency: > > > > > > 1) Whenever a page is found which contains the data I'm looking for, > > > improve > > > overall score for all pages linking to it (and pages linking to them > and > > so > > > on...), assuming they have other links that point to content I'm > looking > > > for. > > > 2) Once I already found several pages that contain relevant data - > create > > a > > > Regex automatically to match new urls which might contain usable > content. > > > > > > I've started to read about the OPIC-score plugin but was unable to > > > understand if it can help me or not with issue no. 1. > > > > > > Any idea guys? I will be very grateful for any help or things that can > > > point > > > me in the right direction. > > > > > > Thanks, > > > Eran > > > > > > > > > > > -- > > -MilleBii- > > > -- -MilleBii-
