oops : why it shouldn't work for others. 2009/11/28 MilleBii <[email protected]>
> I just use the Java build-in regex features... and therefore just supplied > the string, which I design for my case using RegexBuddy a really great tool > by the way. > > Pay attention though at static creation in order to avoid regex creation at > each plug-in load and run-time hit. > > Didn't find a way to modify inlinks... on the other hand inlinks you have > gone through already when you are evaluating a given page so I did not > bother and it works fine for me, I don't see why it should work for others. > > > 2009/11/28 Eran Zinman <[email protected]> > > Thanks for your help MillBii! >> >> I will definitely try the squareroot option - but is that only valid for >> outlinks or also affects pages linking to the page? >> >> Did you try implementing automatic Regex generation? I'm doing focused >> crawling but I'm also thinking about scaling it in the future. >> >> Also I will be happy to know if anyone else have any other suggestion (or >> already implemented strategy) - I think this issue affects most of the >> Nutch >> community - at least people that use Nutch for focused crawling. >> >> Thanks, >> Eran >> >> On Fri, Nov 27, 2009 at 8:29 PM, MilleBii <[email protected]> wrote: >> >> > Well I have created for my own application is topical-scoring plugin : >> > >> > 1. first I needed to score the pages after parsing based on my regular >> > expression >> > >> > 2. then I searched several options on to how boost score of that >> pages... I >> > have only found a way to boost the score of the outlinks of these pages >> > that >> > have content which I wanted. Not perfect but so be it there is a high >> > likelyhood in my case that adjacent pages have also content which I >> want. >> > >> > 3. then how to boost the score... this took me a while to figure out, I >> > leave you all the options I tried. The good comprise I found is the >> > following: >> > if the page has content I want and score < 1.0f than score= >> > squareroot(score)... in this way you are adding weight to the pages >> which >> > have content you are looking (since score is usually below 1. >> > squareroot(x) >> > is bigger than x). >> > >> > Of course there are some down side to that approach, it is more >> difficult >> > to >> > get the crawler to go outsides sites that have content your are looking >> > for, >> > it is a bit like digging a hole and until you have finished the hole it >> > will >> > get the crawler to explore it... experimentally I have found that it >> works >> > nicely for me though, if you limit the nbre of URLS per site it won't >> spend >> > it's life on them. >> > >> > We could try to generalize this plug-in by putting the regular >> expression >> > as >> > as config item because that is really the only thing which is specific >> to >> > my >> > application I believe. >> > >> > >> > >> > 2009/11/27 Eran Zinman <[email protected]> >> > >> > > Hi all, >> > > >> > > I'm try to figure out ways to improve Nutch focused crawling >> efficiency. >> > > >> > > I'm looking for certain pages inside each domain which contains >> content >> > I'm >> > > looking for. >> > > >> > > I'm unable to know that a certain URL contains what I'm looking for >> > unless >> > > I >> > > parse it and do some analysis on it. >> > > >> > > Basically I was thinking about two methods to improve crawling >> > efficiency: >> > > >> > > 1) Whenever a page is found which contains the data I'm looking for, >> > > improve >> > > overall score for all pages linking to it (and pages linking to them >> and >> > so >> > > on...), assuming they have other links that point to content I'm >> looking >> > > for. >> > > 2) Once I already found several pages that contain relevant data - >> create >> > a >> > > Regex automatically to match new urls which might contain usable >> content. >> > > >> > > I've started to read about the OPIC-score plugin but was unable to >> > > understand if it can help me or not with issue no. 1. >> > > >> > > Any idea guys? I will be very grateful for any help or things that can >> > > point >> > > me in the right direction. >> > > >> > > Thanks, >> > > Eran >> > > >> > >> > >> > >> > -- >> > -MilleBii- >> > >> > > > > -- > -MilleBii- > -- -MilleBii-
