Weird, i didn't see my own mail arriving on the list, i sent it via kmail but 
am on webmail now, which seems to work. Anyway, for vertical search on a whole 
website i would rely on your (customized) Lucene similarity and proper 
analysis, but also downgrading `bad` pages for which you can make custom 
classifier plugins in Nutch. That way you can, for example, get rid of hub 
pages and promote actual content.


Anyway, it all depends on what you want to achieve, which is....? :)



 
 
-----Original message-----
> From:Lewis John Mcgibbney <lewis.mcgibb...@gmail.com 
> <mailto:lewis.mcgibb...@gmail.com> >
> Sent: Wednesday 10th September 2014 20:09
> To: Markus Jelsma <mar...@openindex.io <mailto:mar...@openindex.io> >
> Cc: user@nutch.apache.org <mailto:user@nutch.apache.org> 
> Subject: Re: Revisiting Loops Job in Nutch Trunk
> 
> Hi Markus,
> Yeah +1 on this one. I was aware of that. The documentation is clear about
> the fact this LinkRank is suited for horizontal crawl scenarios.
> Which is making me think about an alternative which is more adequately
> suited to domain specific vertical scenarios.
> 
> On Wed, Sep 10, 2014 at 7:49 AM, Markus Jelsma <mar...@openindex.io 
> <mailto:mar...@openindex.io> > wrote:
> 
> > Hi - i would not use LinkRank on small scale crawls, and neither for
> > verticals, if internal links are ignored, there are few links to score, if
> > not, the graph is too dense.
> >
> > It is only useful - for me/us - to let the web decide what hosts and pages
> > are
> > popular, so that means large scale.
> >
> > On Wednesday 10 September 2014 07:43:34 Lewis John Mcgibbney wrote:
> > > Hi Markus,
> > >
> > > On Wed, Sep 10, 2014 at 2:00 AM, <user-digest-h...@nutch.apache.org 
> > > <mailto:user-digest-h...@nutch.apache.org> >
> > wrote:
> > > > Hey Lewis,
> > > >
> > > > We didn't use it in the end, but did run the LinkRank on large amounts
> > of
> > > > data. We then used the scores generated by it for biasing a
> > deduplication
> > > > algorithm. We tested it thoroughly and never stumbled on issues that
> > could
> > > > have been resolved using the Loops algorithm.
> > > >
> > > > Thanks for reply Markus.
> > >
> > > OK so here is the deal, we are currently exhausting vertical crawls on
> > > around 20-30 domains. We are not obtaining external links at the moment
> > to
> > > domains outside of those target domains, so I've adjusted the <linkrank>
> > > properties in nutch-site.xml accordingly along with other related
> > > properties and config to restric the crawl as such.
> > > I am going to experiment with using both options in an attempt to move
> > > towards attacking this documentation and substantiating upon my own
> > > understanding.
> > > Thanks for your reply.
> > > Lewis
> >
> >
> 
> 
> -- 
> *Lewis*
> 

Reply via email to