Hi - no, not scoring-link, check out the LinkRank pages on the wiki, and the 
webgraph job that Nutch has. It builds a webgraph, then performs linkrank and 
then writes the scores back to the crawldb. It is a slow process, but very 
powerful. We don't use it for document boosting but to determine top ranking 
hosts on a large scale. 
 
-----Original message-----
> From:Eyeris RodrIguez Rueda <eru...@uci.cu>
> Sent: Wednesday 20th May 2015 23:28
> To: user@nutch.apache.org
> Subject: Re: [MASSMAIL]Re: about boost field extremely high
> 
> Thanks to all by your quick reply.
> 
> Is there any description about how function scoring-link? i was reading the 
> source code but don't understand at all. 
> 
> Markus are you suggesting me use scoring-link plugin, is this Nutch' LinkRank 
>  or not?
> 
> I really appreciated your help.
> 
> 
> 
> 
> ----- Mensaje original -----
> De: "Markus Jelsma" <markus.jel...@openindex.io>
> Para: user@nutch.apache.org
> Enviados: Miércoles, 20 de Mayo 2015 16:53:26
> Asunto: RE: [MASSMAIL]Re: about boost field extremely high
> 
> Yes indeed. But it also makes sense to rely on Lucene's scoring algorithms 
> and custom boosting functions. The problem with generic document boosting is 
> that they can negatively influence your result sets. Causing non-relevant, 
> but highly scored documents, on top. Another alternative is to use Nutch' 
> LinkRank, it is batch oriented but much more powerful. 
>  
> -----Original message-----
> > From:Julien Nioche <lists.digitalpeb...@gmail.com>
> > Sent: Wednesday 20th May 2015 22:10
> > To: user@nutch.apache.org
> > Subject: Re: [MASSMAIL]Re: about boost field extremely high
> > 
> > See https://issues.apache.org/jira/browse/NUTCH-1958 and the reference to a
> > related discussion. The choice of scoring depends on the nature of your
> > crawl, you can also not use a scoring filter at all in which case all the
> > docs will get a boost of 1
> > 
> > 
> > On 20 May 2015 at 20:55, Eyeris RodrIguez Rueda <eru...@uci.cu> wrote:
> > 
> > > Yes Julien.
> > > Im using only scoring-opic. this my plugin.include property.
> > > I have attached my nutch-site.xml
> > > is there any problem with scoring opic ?
> > > Do you recommend me use another scoring(depth or link)?
> > >
> > > <property>
> > >   <name>plugin.includes</name>
> > >
> > > <value>protocol-(http|httpclient)|urlfilter-(domain|regex|domainblacklist)|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata|required)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|microformats-customtag|language-identifier|links-extractor|mimetype-filter|mimetype-alias-filter</value>
> > >   <description>Regular expression naming plugin directory names to
> > >   include.  Any plugin not matching this expression is excluded.
> > >   In any case you need at least include the nutch-extensionpoints plugin.
> > > By
> > >   default Nutch includes crawling just HTML and plain text via HTTP,
> > >   and basic indexing and search plugins. In order to use HTTPS please
> > > enable
> > >   protocol-httpclient, but be aware of possible intermittent problems with
> > > the
> > >   underlying commons-httpclient library.
> > >   </description>
> > > </property>
> > >
> > >
> > >
> > > ----- Mensaje original -----
> > > De: "Julien Nioche" <lists.digitalpeb...@gmail.com>
> > > Para: user@nutch.apache.org
> > > Enviados: Miércoles, 20 de Mayo 2015 15:06:38
> > > Asunto: [MASSMAIL]Re: about boost field extremely high
> > >
> > > Hi Eyeris
> > >
> > > The boost value is simply the output of what the ScoringFilters give for a
> > > document. Are you using OPIC?
> > >
> > > Julien
> > >
> > > On 20 May 2015 at 19:32, Eyeris RodrIguez Rueda <eru...@uci.cu> wrote:
> > >
> > > > Hi all.
> > > > Im using nutch 1.9 in local mode and solr 4.10 with half million of
> > > > documents.
> > > > An adaptive fetch schedule is being used for crawl pages that changes
> > > > frequently.
> > > > I have detected that nutch is calculting a extremely high boost for some
> > > > documents and the document score in Solr is extremely high for these
> > > > documents, and
> > > > in consequence the order of documents is changed by this wrong boost.
> > > > This a correct solr output for me using "cubadebate" query:
> > > > *******************************
> > > > {
> > > >   "responseHeader": {
> > > >     "status": 0,
> > > >     "QTime": 195
> > > >   },
> > > >   "response": {
> > > >     "numFound": 183486,
> > > >     "start": 0,
> > > >     "maxScore": 2.7115784,
> > > >     "docs": [
> > > >       {
> > > >         "url": "http://www.cubadebate.cu/";,
> > > >         "boost": 1.0175576,
> > > >         "score": 2.7115784
> > > >       },
> > > >       {
> > > >         "url": "http://www.cubadebate.cu/editores/preguntas-frecuentes/
> > > ",
> > > >         "boost": 0.11512774,
> > > >         "score": 0.59315777
> > > >       },
> > > >       {
> > > >         "url": "http://www.cubadebate.cu/editores/";,
> > > >         "boost": 0.16240995,
> > > >         "score": 0.50842094
> > > >       },
> > > >       {
> > > >         "url": "http://www.cubadebate.cu/feed/";,
> > > >         "boost": 0.8635264,
> > > >         "score": 0.42501986
> > > >       },
> > > >       {
> > > >         "url": "http://www.cubadebate.cu/etiqueta/cine/";,
> > > >         "boost": 0.13792185,
> > > >         "score": 0.3541832
> > > >       },
> > > >       {
> > > >         "url": "http://www.cubadebate.cu/web2/";,
> > > >         "boost": 0.114989564,
> > > >         "score": 0.3389473
> > > >       },
> > > >       {
> > > >         "url": "
> > > > http://www.cubadebate.cu/opinion/2015/03/06/diferencias-conciliables/";,
> > > >         "boost": 0.18748672,
> > > >         "score": 0.28334656
> > > >       },
> > > >       {
> > > >         "url": "
> > > >
> > > http://www.cubadebate.cu/noticias/2015/02/02/freddy-asiel-voy-por-el-desquite/
> > > > ",
> > > >         "boost": 0.13997546,
> > > >         "score": 0.28334656
> > > >       },
> > > >       {
> > > >         "url": "
> > > > http://www.cubadebate.cu/especiales/2015/03/05/querido-hugo/";,
> > > >         "boost": 0.13172969,
> > > >         "score": 0.28334656
> > > >       },
> > > >       {
> > > >         "url": "
> > > >
> > > http://www.cubadebate.cu/noticias/2015/02/08/grammys-la-lista-completa-de-los-ganadores/comment-page-1/
> > > > ",
> > > >         "boost": 0.12959023,
> > > >         "score": 0.24792825
> > > >       }
> > > >     ]
> > > >   },
> > > > ***********************************************
> > > > this a incorrect solr output using "cubadebate" query:
> > > > {
> > > >   "responseHeader": {
> > > >     "status": 0,
> > > >     "QTime": 111
> > > >   },
> > > >   "response": {
> > > >     "numFound": 172952,
> > > >     "start": 0,
> > > >     "maxScore": 22939964,
> > > >     "docs": [
> > > >       {
> > > >         "url": "
> > > >
> > > http://www.tvcubana.icrt.cu/seccion-temas/1088-yo-tambien-estoy-en-la-celac
> > > > ",
> > > >         "boost": 1422334460,
> > > >         "score": 22939964
> > > >       },
> > > >       {
> > > >         "url": "
> > > >
> > > http://www.perlavision.icrt.cu/index.php/deportes/boxeo/14065-domadores-de-cuba-enfrentaran-a-guerreros-de-mexico-en-semifinal-de-la-v-serie-mundial-de-boxeo
> > > > ",
> > > >         "boost": 1675646080,
> > > >         "score": 22476484
> > > >       },
> > > >       {
> > > >         "url": "http://www.radiohc.cu/noticias/deportes/page/387";,
> > > >         "boost": 1325039870,
> > > >         "score": 21191032
> > > >       },
> > > >       {
> > > >         "url": "
> > > >
> > > http://www.perlavision.icrt.cu/index.php/bloqueo/13922-nacera-en-mayo-engage-cuba-un-vigoroso-lobby-antibloqueo-en-congreso-de-eeuu
> > > > ",
> > > >         "boost": 1663792640,
> > > >         "score": 18730402
> > > >       },
> > > >       {
> > > >         "url": "
> > > >
> > > http://www.perlavision.icrt.cu/index.php/deportes/boxeo/14004-cuba-en-semifinales-de-serie-mundial-el-proximo-mes
> > > > ",
> > > >         "boost": 1528675840,
> > > >         "score": 18730402
> > > >       },
> > > >       {
> > > >         "url": "http://www.radiohc.cu/noticias/ciencias/page/76";,
> > > >         "boost": 1326217090,
> > > >         "score": 18542152
> > > >       },
> > > >       {
> > > >         "url": "http://www.radiohc.cu/noticias/cultura/page/272";,
> > > >         "boost": 1327128190,
> > > >         "score": 18542152
> > > >       },
> > > >       {
> > > >         "url": "
> > > >
> > > http://www.tvcubana.icrt.cu/archivo/118-archiv0/1060-beisbol-cubano-sera-el-tema-de-la-mesa-redonda-en-sus-emisiones-de-miercoles-y-jueves
> > > > ",
> > > >         "boost": 1424298370,
> > > >         "score": 18542152
> > > >       },
> > > >       {
> > > >         "url": "
> > > >
> > > http://www.tvcubana.icrt.cu/archivo/118-archiv0/1073-el-programa-nacional-de-medicamentos-en-la-mesa-redonda-miercoles-y-jueves
> > > > ",
> > > >         "boost": 1424231940,
> > > >         "score": 18542152
> > > >       },
> > > >       {
> > > >         "url": "
> > > >
> > > http://www.tvcubana.icrt.cu/archivo/118-archiv0/897-la-mesa-redonda-presentara-miercoles-y-jueves-las-cooerativas-no-agropecuarias-p
> > > > ",
> > > >         "boost": 1424386690,
> > > >         "score": 18542152
> > > >       }
> > > >     ]
> > > >   },
> > > >
> > > > In this case the boost is extremely high,
> > > > i have look at solrindexer plugin and i have seen this line 123
> > > >   inputDoc.setDocumentBoost(doc.getWeight());
> > > >
> > > > in IndexerMapReduce.java(src/java/org/apache/nutch/indexer) in line 316
> > > > also similar things:
> > > > i think this increase the boost for all document.
> > > >  // apply boost to all indexed fields.
> > > >     doc.setWeight(boost);
> > > >
> > > > Please i really appreciated any advice or solution for this problem.
> > > > Thanks in advance.
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > > http://twitter.com/digitalpebble
> > >
> > >
> > 
> > 
> > -- 
> > 
> > Open Source Solutions for Text Engineering
> > 
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> > 
> 

Reply via email to