Re: Efficient focused crawling

Ken Krugler Sun, 29 Nov 2009 07:09:51 -0800


On Nov 28, 2009, at 12:42am, Eran Zinman wrote:

Thanks for your help MillBii!
I will definitely try the squareroot option - but is that only validfor
outlinks or also affects pages linking to the page?

Did you try implementing automatic Regex generation? I'm doing focused
crawling but I'm also thinking about scaling it in the future.
Also I will be happy to know if anyone else have any othersuggestion (oralready implemented strategy) - I think this issue affects most ofthe Nutch
community - at least people that use Nutch for focused crawling.

I believe I wrote several emails to this list a while ago about how wehandled this for Krugle, so those could prove useful.


The basic concept was:

- Use machine learning to train a classifier (we used term vectorsimilarity and max entropy) for good & bad pages.

- Run this on each fetched page, and use the results to adjust theoutbound OPIC link scores.

We didn't bother following backlinks. I don't think this would impactyour crawl frontier generation much.

But a different approach would be to stop the crawl after somesignificant # of pages had been fetched, then use a modified versionof the PageRank calculation that factors in the classification score.This could give you better static stores, and you could also use thisnew score to recalculate scores for unfetched pages, based on thesource pages' static scores.


-- Ken

Thanks,
Eran

On Fri, Nov 27, 2009 at 8:29 PM, MilleBii <[email protected]> wrote:
Well I have created for my own application is topical-scoringplugin :
1. first I needed to score the pages after parsing based on myregular
expression
2. then I searched several options on to how boost score of thatpages... Ihave only found a way to boost the score of the outlinks of thesepages
that
have content which I wanted. Not perfect but so be it there is a high
likelyhood in my case that adjacent pages have also content which Iwant.
3. then how to boost the score... this took me a while to figureout, I
leave you all the options I tried. The good comprise I found is the
following:
 if the page has content I want and score < 1.0f than score=
squareroot(score)... in this way you are adding weight to the pageswhich
have content you are looking  (since score is usually below 1.
squareroot(x)
is bigger than x).
Of course there are some down side to that approach, it is moredifficult
to
get the crawler to go outsides sites that have content your arelooking
for,
it is a bit like digging a hole and until you have finished thehole it
will
get the crawler to explore it... experimentally I have found thatit worksnicely for me though, if you limit the nbre of URLS per site itwon't spend
it's life on them.
We could try to generalize this plug-in by putting the regularexpression
as
as config item because that is really the only thing which isspecific to
my
application I believe.



2009/11/27 Eran Zinman <[email protected]>
Hi all,
I'm try to figure out ways to improve Nutch focused crawlingefficiency.
I'm looking for certain pages inside each domain which containscontent
I'm
looking for.

I'm unable to know that a certain URL contains what I'm looking for
unless
I
parse it and do some analysis on it.

Basically I was thinking about two methods to improve crawling
efficiency:
1) Whenever a page is found which contains the data I'm looking for,
improve
overall score for all pages linking to it (and pages linking tothem and
so
on...), assuming they have other links that point to content I'mlooking
for.
2) Once I already found several pages that contain relevant data -create
a
Regex automatically to match new urls which might contain usablecontent.
I've started to read about the OPIC-score plugin but was unable to
understand if it can help me or not with issue no. 1.
Any idea guys? I will be very grateful for any help or things thatcan
point
me in the right direction.

Thanks,
Eran
--
-MilleBii-


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Efficient focused crawling

Reply via email to