RE: Document scores(boost)

Markus Jelsma Thu, 10 Sep 2015 11:40:17 -0700

Hello, if you are really interested in having offline scores calculated then 
ideally you must perform those jobs after updating the DB and before indexing, 
at each cycle because you probably get new data. However, you can also use it 
asyncronously by periodically dumping the scores to a flat file (NodeDumper can 
do that). Solr can then read that file as an External File Field.


But again, only if you really need it. By default the webgraph ignores internal 
links, for good reasons, as the graph will become too dense and internal scores 
are not very useful. In almost all cases, you don't need it, only if you are 
going to crawl very large portions of the web. I most cases, TF*IDF or BM25 
scoring in Solr/Lucene is superiour.

Markus
 
 
-----Original message-----
> From:Imtiaz Shakil Siddique <shakilsust...@gmail.com>
> Sent: Thursday 10th September 2015 19:11
> To: user@nutch.apache.org
> Subject: RE: Document scores(boost)
> 
> Hello Markus Jelsma,
> 
> Thank you for the advice. But this score calculation is done after the data
> is indexed to solr. So when the scores are updated inside the crawldb Solr
> won't get it.
> 
> I think a workaround for this problem would be shifting the solr index
> phase at the bottom of all the operations.
> But one thing I'm not clear is that how often should I run this webgraph
> update commands .
> 
> Thank you,
> Imtiaz Shakil Siddique
> On Sep 10, 2015 8:36 PM, "Markus Jelsma" <markus.jel...@openindex.io> wrote:
> 
> > Yes, remove OPIC from the config will simple disable it.
> >
> > The webgraph program will create a webgraph datastructure for the
> > specified segments. The linkrank program will then calculate the scores for
> > each node. Finally, the scoreupdater writes the score from the webgraph
> > back into the crawldb. This program is very intensive. Use it only if you
> > really need it.
> >
> > Markus
> >
> > -----Original message-----
> > > From:Imtiaz Shakil Siddique <shakilsust...@gmail.com>
> > > Sent: Thursday 10th September 2015 16:04
> > > To: user@nutch.apache.org
> > > Subject: Re: Document scores(boost)
> > >
> > > Hello Markus Jelsma,
> > >
> > > So you are suggesting that I should
> > > 1. remove "scoring-opic" plugin
> > > 2. run the webgraph > linkrank > scoreupdater from /bin/crawl script
> > > if I want to calculate document boost with all segments in hand.
> > >
> > >
> > > It'd be very helpful if you could explain what these four things do (
> > webgraph,
> > > linkrank, scoreupdater,nodedumper )
> > >
> > > Thank you so much for the help.
> > > Imtiaz Shakil Siddique
> > >
> > >
> > > On 10 September 2015 at 19:27, Markus Jelsma <markus.jel...@openindex.io
> > >
> > > wrote:
> > >
> > > > Hello - OPIC is useless in incremental crawls. You can either disable
> > > > scoring altogether, or use webgraph > linkrank > scoreupdater.
> > > > Markus
> > > >
> > > > -----Original message-----
> > > > > From:Imtiaz Shakil Siddique <shakilsust...@gmail.com>
> > > > > Sent: Wednesday 9th September 2015 23:09
> > > > > To: user@nutch.apache.org
> > > > > Subject: Document scores(boost)
> > > > >
> > > > > Hello,
> > > > > I've been using nutch 1.9/1.10 for about six months. One thing I
> > noticed
> > > > > that at each iteration(during parsing phase) nutch calculates
> > document
> > > > > boost(using Opic algorithm)
> > > > >
> > > > > 1. My question is how this score is adjusted with respect to all the
> > > > > segments.
> > > > >
> > > > > 2. Another question is inside bin/crawl script what does the
> > webgraph,
> > > > > linkrank, scoreupdater,nodedumper do? Can anyone be kind enough to
> > > > explain?
> > > > >
> > > > > Thank you so much.
> > > > > Imtiaz Shakil Siddique
> > > > >
> > > >
> > >
> >
>

RE: Document scores(boost)

Reply via email to