I just use the Java build-in regex features... and therefore just supplied
the string, which I design for my case using RegexBuddy a really great tool
by the way.

Pay attention though at static creation in order to avoid regex creation at
each plug-in load and run-time hit.

Didn't find a way to modify inlinks... on the other hand  inlinks you have
gone through already when you are evaluating a given page so I did not
bother and it works fine for me, I don't see why it should work for others.


2009/11/28 Eran Zinman <[email protected]>

> Thanks for your help MillBii!
>
> I will definitely try the squareroot option - but is that only valid for
> outlinks or also affects pages linking to the page?
>
> Did you try implementing automatic Regex generation? I'm doing focused
> crawling but I'm also thinking about scaling it in the future.
>
> Also I will be happy to know if anyone else have any other suggestion (or
> already implemented strategy) - I think this issue affects most of the
> Nutch
> community - at least people that use Nutch for focused crawling.
>
> Thanks,
> Eran
>
> On Fri, Nov 27, 2009 at 8:29 PM, MilleBii <[email protected]> wrote:
>
> > Well  I have created for my own application is topical-scoring plugin :
> >
> > 1.  first I needed to score the pages after parsing based on my regular
> > expression
> >
> > 2. then I searched several options on to how boost score of that pages...
> I
> > have only found a way to boost the score of the outlinks of these pages
> > that
> > have content which I wanted. Not perfect but so be it there is a high
> > likelyhood in my case that adjacent pages have also content which I want.
> >
> > 3. then how to boost the score... this took me a while to figure out, I
> > leave you all the options I tried. The good comprise I found is the
> > following:
> >   if the page has content I want and score < 1.0f than score=
> > squareroot(score)... in this way you are adding weight to the pages which
> > have content you are looking  (since score is usually below 1.
> > squareroot(x)
> > is bigger than x).
> >
> > Of course there are some down side to that approach, it is more difficult
> > to
> > get the crawler to go outsides sites that have content your are looking
> > for,
> > it is a bit like digging a hole and until you have finished the hole it
> > will
> > get the crawler to explore it... experimentally I have found that it
> works
> > nicely for me though, if you limit the nbre of URLS per site it won't
> spend
> > it's life on them.
> >
> > We could try to generalize this plug-in by putting the regular expression
> > as
> > as config item because that is really the only thing which is specific to
> > my
> > application I believe.
> >
> >
> >
> > 2009/11/27 Eran Zinman <[email protected]>
> >
> > > Hi all,
> > >
> > > I'm try to figure out ways to improve Nutch focused crawling
> efficiency.
> > >
> > > I'm looking for certain pages inside each domain which contains content
> > I'm
> > > looking for.
> > >
> > > I'm unable to know that a certain URL contains what I'm looking for
> > unless
> > > I
> > > parse it and do some analysis on it.
> > >
> > > Basically I was thinking about two methods to improve crawling
> > efficiency:
> > >
> > > 1) Whenever a page is found which contains the data I'm looking for,
> > > improve
> > > overall score for all pages linking to it (and pages linking to them
> and
> > so
> > > on...), assuming they have other links that point to content I'm
> looking
> > > for.
> > > 2) Once I already found several pages that contain relevant data -
> create
> > a
> > > Regex automatically to match new urls which might contain usable
> content.
> > >
> > > I've started to read about the OPIC-score plugin but was unable to
> > > understand if it can help me or not with issue no. 1.
> > >
> > > Any idea guys? I will be very grateful for any help or things that can
> > > point
> > > me in the right direction.
> > >
> > > Thanks,
> > > Eran
> > >
> >
> >
> >
> > --
> > -MilleBii-
> >
>



-- 
-MilleBii-

Reply via email to