Re: Efficient focused crawling

Eran Zinman Sat, 28 Nov 2009 00:42:59 -0800

Thanks for your help MillBii!

I will definitely try the squareroot option - but is that only valid for
outlinks or also affects pages linking to the page?


Did you try implementing automatic Regex generation? I'm doing focused
crawling but I'm also thinking about scaling it in the future.

Also I will be happy to know if anyone else have any other suggestion (or
already implemented strategy) - I think this issue affects most of the Nutch
community - at least people that use Nutch for focused crawling.

Thanks,
Eran

On Fri, Nov 27, 2009 at 8:29 PM, MilleBii <[email protected]> wrote:

> Well  I have created for my own application is topical-scoring plugin :
>
> 1.  first I needed to score the pages after parsing based on my regular
> expression
>
> 2. then I searched several options on to how boost score of that pages... I
> have only found a way to boost the score of the outlinks of these pages
> that
> have content which I wanted. Not perfect but so be it there is a high
> likelyhood in my case that adjacent pages have also content which I want.
>
> 3. then how to boost the score... this took me a while to figure out, I
> leave you all the options I tried. The good comprise I found is the
> following:
>   if the page has content I want and score < 1.0f than score=
> squareroot(score)... in this way you are adding weight to the pages which
> have content you are looking  (since score is usually below 1.
> squareroot(x)
> is bigger than x).
>
> Of course there are some down side to that approach, it is more difficult
> to
> get the crawler to go outsides sites that have content your are looking
> for,
> it is a bit like digging a hole and until you have finished the hole it
> will
> get the crawler to explore it... experimentally I have found that it works
> nicely for me though, if you limit the nbre of URLS per site it won't spend
> it's life on them.
>
> We could try to generalize this plug-in by putting the regular expression
> as
> as config item because that is really the only thing which is specific to
> my
> application I believe.
>
>
>
> 2009/11/27 Eran Zinman <[email protected]>
>
> > Hi all,
> >
> > I'm try to figure out ways to improve Nutch focused crawling efficiency.
> >
> > I'm looking for certain pages inside each domain which contains content
> I'm
> > looking for.
> >
> > I'm unable to know that a certain URL contains what I'm looking for
> unless
> > I
> > parse it and do some analysis on it.
> >
> > Basically I was thinking about two methods to improve crawling
> efficiency:
> >
> > 1) Whenever a page is found which contains the data I'm looking for,
> > improve
> > overall score for all pages linking to it (and pages linking to them and
> so
> > on...), assuming they have other links that point to content I'm looking
> > for.
> > 2) Once I already found several pages that contain relevant data - create
> a
> > Regex automatically to match new urls which might contain usable content.
> >
> > I've started to read about the OPIC-score plugin but was unable to
> > understand if it can help me or not with issue no. 1.
> >
> > Any idea guys? I will be very grateful for any help or things that can
> > point
> > me in the right direction.
> >
> > Thanks,
> > Eran
> >
>
>
>
> --
> -MilleBii-
>

Re: Efficient focused crawling

Reply via email to