oops : why it shouldn't work for others.

2009/11/28 MilleBii <[email protected]>

> I just use the Java build-in regex features... and therefore just supplied
> the string, which I design for my case using RegexBuddy a really great tool
> by the way.
>
> Pay attention though at static creation in order to avoid regex creation at
> each plug-in load and run-time hit.
>
> Didn't find a way to modify inlinks... on the other hand  inlinks you have
> gone through already when you are evaluating a given page so I did not
> bother and it works fine for me, I don't see why it should work for others.
>
>
> 2009/11/28 Eran Zinman <[email protected]>
>
> Thanks for your help MillBii!
>>
>> I will definitely try the squareroot option - but is that only valid for
>> outlinks or also affects pages linking to the page?
>>
>> Did you try implementing automatic Regex generation? I'm doing focused
>> crawling but I'm also thinking about scaling it in the future.
>>
>> Also I will be happy to know if anyone else have any other suggestion (or
>> already implemented strategy) - I think this issue affects most of the
>> Nutch
>> community - at least people that use Nutch for focused crawling.
>>
>> Thanks,
>> Eran
>>
>> On Fri, Nov 27, 2009 at 8:29 PM, MilleBii <[email protected]> wrote:
>>
>> > Well  I have created for my own application is topical-scoring plugin :
>> >
>> > 1.  first I needed to score the pages after parsing based on my regular
>> > expression
>> >
>> > 2. then I searched several options on to how boost score of that
>> pages... I
>> > have only found a way to boost the score of the outlinks of these pages
>> > that
>> > have content which I wanted. Not perfect but so be it there is a high
>> > likelyhood in my case that adjacent pages have also content which I
>> want.
>> >
>> > 3. then how to boost the score... this took me a while to figure out, I
>> > leave you all the options I tried. The good comprise I found is the
>> > following:
>> >   if the page has content I want and score < 1.0f than score=
>> > squareroot(score)... in this way you are adding weight to the pages
>> which
>> > have content you are looking  (since score is usually below 1.
>> > squareroot(x)
>> > is bigger than x).
>> >
>> > Of course there are some down side to that approach, it is more
>> difficult
>> > to
>> > get the crawler to go outsides sites that have content your are looking
>> > for,
>> > it is a bit like digging a hole and until you have finished the hole it
>> > will
>> > get the crawler to explore it... experimentally I have found that it
>> works
>> > nicely for me though, if you limit the nbre of URLS per site it won't
>> spend
>> > it's life on them.
>> >
>> > We could try to generalize this plug-in by putting the regular
>> expression
>> > as
>> > as config item because that is really the only thing which is specific
>> to
>> > my
>> > application I believe.
>> >
>> >
>> >
>> > 2009/11/27 Eran Zinman <[email protected]>
>> >
>> > > Hi all,
>> > >
>> > > I'm try to figure out ways to improve Nutch focused crawling
>> efficiency.
>> > >
>> > > I'm looking for certain pages inside each domain which contains
>> content
>> > I'm
>> > > looking for.
>> > >
>> > > I'm unable to know that a certain URL contains what I'm looking for
>> > unless
>> > > I
>> > > parse it and do some analysis on it.
>> > >
>> > > Basically I was thinking about two methods to improve crawling
>> > efficiency:
>> > >
>> > > 1) Whenever a page is found which contains the data I'm looking for,
>> > > improve
>> > > overall score for all pages linking to it (and pages linking to them
>> and
>> > so
>> > > on...), assuming they have other links that point to content I'm
>> looking
>> > > for.
>> > > 2) Once I already found several pages that contain relevant data -
>> create
>> > a
>> > > Regex automatically to match new urls which might contain usable
>> content.
>> > >
>> > > I've started to read about the OPIC-score plugin but was unable to
>> > > understand if it can help me or not with issue no. 1.
>> > >
>> > > Any idea guys? I will be very grateful for any help or things that can
>> > > point
>> > > me in the right direction.
>> > >
>> > > Thanks,
>> > > Eran
>> > >
>> >
>> >
>> >
>> > --
>> > -MilleBii-
>> >
>>
>
>
>
> --
> -MilleBii-
>



-- 
-MilleBii-

Reply via email to