Re: Efficient focused crawling

MilleBii Sat, 28 Nov 2009 03:48:10 -0800

Oh !

1. not worked but if you find something I'm interested


2. the inlinks by definition points to the page you are considering.... so I
don't understand what you mean. Boosting those inlinks actually means giving
more weight to the page which gets distributed to the outlinks
But probably what you meant is you would like to boost outlinks of the pages
pointing to the page being evaluated, a kind of backtrack if you will...
quite often pages point to each other or via a circle or via kind of site
menu... so you get that effect in a way and the root page of site
accumulates the weight of refered pages.





2009/11/28 Eran Zinman <[email protected]>

> Hi MilleBii,
>
> I think you misinterpreted what I've meant.
>
> 1. Regarding Regex - I know I can build a Regex beforehand to identify
> URLs,
> but I will have to create one manually for each domain I'm crawling - not
> scalable. I'm looking for a way to build Regex automatically using
> automatic
> machine learning. I know to identify if a certain page contains the content
> I'm looking for only after I parse it. I want my crawler to create
> automatic
> Regex patterns based on it's crawling experience.
>
> 2. I want to boost inlinks not necessarily to crawl them again, but to
> crawl
> in higher priority other links they link to, taking under assumption these
> links might contain the content I'm looking for.
>
> Thanks for your help!
>
> Eran
>
>
>
> On Sat, Nov 28, 2009 at 10:56 AM, MilleBii <[email protected]> wrote:
>
> > oops : why it shouldn't work for others.
> >
> > 2009/11/28 MilleBii <[email protected]>
> >
> > > I just use the Java build-in regex features... and therefore just
> > supplied
> > > the string, which I design for my case using RegexBuddy a really great
> > tool
> > > by the way.
> > >
> > > Pay attention though at static creation in order to avoid regex
> creation
> > at
> > > each plug-in load and run-time hit.
> > >
> > > Didn't find a way to modify inlinks... on the other hand  inlinks you
> > have
> > > gone through already when you are evaluating a given page so I did not
> > > bother and it works fine for me, I don't see why it should work for
> > others.
> > >
> > >
> > > 2009/11/28 Eran Zinman <[email protected]>
> > >
> > > Thanks for your help MillBii!
> > >>
> > >> I will definitely try the squareroot option - but is that only valid
> for
> > >> outlinks or also affects pages linking to the page?
> > >>
> > >> Did you try implementing automatic Regex generation? I'm doing focused
> > >> crawling but I'm also thinking about scaling it in the future.
> > >>
> > >> Also I will be happy to know if anyone else have any other suggestion
> > (or
> > >> already implemented strategy) - I think this issue affects most of the
> > >> Nutch
> > >> community - at least people that use Nutch for focused crawling.
> > >>
> > >> Thanks,
> > >> Eran
> > >>
> > >> On Fri, Nov 27, 2009 at 8:29 PM, MilleBii <[email protected]> wrote:
> > >>
> > >> > Well  I have created for my own application is topical-scoring
> plugin
> > :
> > >> >
> > >> > 1.  first I needed to score the pages after parsing based on my
> > regular
> > >> > expression
> > >> >
> > >> > 2. then I searched several options on to how boost score of that
> > >> pages... I
> > >> > have only found a way to boost the score of the outlinks of these
> > pages
> > >> > that
> > >> > have content which I wanted. Not perfect but so be it there is a
> high
> > >> > likelyhood in my case that adjacent pages have also content which I
> > >> want.
> > >> >
> > >> > 3. then how to boost the score... this took me a while to figure
> out,
> > I
> > >> > leave you all the options I tried. The good comprise I found is the
> > >> > following:
> > >> >   if the page has content I want and score < 1.0f than score=
> > >> > squareroot(score)... in this way you are adding weight to the pages
> > >> which
> > >> > have content you are looking  (since score is usually below 1.
> > >> > squareroot(x)
> > >> > is bigger than x).
> > >> >
> > >> > Of course there are some down side to that approach, it is more
> > >> difficult
> > >> > to
> > >> > get the crawler to go outsides sites that have content your are
> > looking
> > >> > for,
> > >> > it is a bit like digging a hole and until you have finished the hole
> > it
> > >> > will
> > >> > get the crawler to explore it... experimentally I have found that it
> > >> works
> > >> > nicely for me though, if you limit the nbre of URLS per site it
> won't
> > >> spend
> > >> > it's life on them.
> > >> >
> > >> > We could try to generalize this plug-in by putting the regular
> > >> expression
> > >> > as
> > >> > as config item because that is really the only thing which is
> specific
> > >> to
> > >> > my
> > >> > application I believe.
> > >> >
> > >> >
> > >> >
> > >> > 2009/11/27 Eran Zinman <[email protected]>
> > >> >
> > >> > > Hi all,
> > >> > >
> > >> > > I'm try to figure out ways to improve Nutch focused crawling
> > >> efficiency.
> > >> > >
> > >> > > I'm looking for certain pages inside each domain which contains
> > >> content
> > >> > I'm
> > >> > > looking for.
> > >> > >
> > >> > > I'm unable to know that a certain URL contains what I'm looking
> for
> > >> > unless
> > >> > > I
> > >> > > parse it and do some analysis on it.
> > >> > >
> > >> > > Basically I was thinking about two methods to improve crawling
> > >> > efficiency:
> > >> > >
> > >> > > 1) Whenever a page is found which contains the data I'm looking
> for,
> > >> > > improve
> > >> > > overall score for all pages linking to it (and pages linking to
> them
> > >> and
> > >> > so
> > >> > > on...), assuming they have other links that point to content I'm
> > >> looking
> > >> > > for.
> > >> > > 2) Once I already found several pages that contain relevant data -
> > >> create
> > >> > a
> > >> > > Regex automatically to match new urls which might contain usable
> > >> content.
> > >> > >
> > >> > > I've started to read about the OPIC-score plugin but was unable to
> > >> > > understand if it can help me or not with issue no. 1.
> > >> > >
> > >> > > Any idea guys? I will be very grateful for any help or things that
> > can
> > >> > > point
> > >> > > me in the right direction.
> > >> > >
> > >> > > Thanks,
> > >> > > Eran
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > -MilleBii-
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > -MilleBii-
> > >
> >
> >
> >
> > --
> > -MilleBii-
> >
>



-- 
-MilleBii-

Re: Efficient focused crawling

Reply via email to