Thanks to all by your quick reply. Is there any description about how function scoring-link? i was reading the source code but don't understand at all.
Markus are you suggesting me use scoring-link plugin, is this Nutch' LinkRank or not? I really appreciated your help. ----- Mensaje original ----- De: "Markus Jelsma" <markus.jel...@openindex.io> Para: user@nutch.apache.org Enviados: MiƩrcoles, 20 de Mayo 2015 16:53:26 Asunto: RE: [MASSMAIL]Re: about boost field extremely high Yes indeed. But it also makes sense to rely on Lucene's scoring algorithms and custom boosting functions. The problem with generic document boosting is that they can negatively influence your result sets. Causing non-relevant, but highly scored documents, on top. Another alternative is to use Nutch' LinkRank, it is batch oriented but much more powerful. -----Original message----- > From:Julien Nioche <lists.digitalpeb...@gmail.com> > Sent: Wednesday 20th May 2015 22:10 > To: user@nutch.apache.org > Subject: Re: [MASSMAIL]Re: about boost field extremely high > > See https://issues.apache.org/jira/browse/NUTCH-1958 and the reference to a > related discussion. The choice of scoring depends on the nature of your > crawl, you can also not use a scoring filter at all in which case all the > docs will get a boost of 1 > > > On 20 May 2015 at 20:55, Eyeris RodrIguez Rueda <eru...@uci.cu> wrote: > > > Yes Julien. > > Im using only scoring-opic. this my plugin.include property. > > I have attached my nutch-site.xml > > is there any problem with scoring opic ? > > Do you recommend me use another scoring(depth or link)? > > > > <property> > > <name>plugin.includes</name> > > > > <value>protocol-(http|httpclient)|urlfilter-(domain|regex|domainblacklist)|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata|required)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|microformats-customtag|language-identifier|links-extractor|mimetype-filter|mimetype-alias-filter</value> > > <description>Regular expression naming plugin directory names to > > include. Any plugin not matching this expression is excluded. > > In any case you need at least include the nutch-extensionpoints plugin. > > By > > default Nutch includes crawling just HTML and plain text via HTTP, > > and basic indexing and search plugins. In order to use HTTPS please > > enable > > protocol-httpclient, but be aware of possible intermittent problems with > > the > > underlying commons-httpclient library. > > </description> > > </property> > > > > > > > > ----- Mensaje original ----- > > De: "Julien Nioche" <lists.digitalpeb...@gmail.com> > > Para: user@nutch.apache.org > > Enviados: MiƩrcoles, 20 de Mayo 2015 15:06:38 > > Asunto: [MASSMAIL]Re: about boost field extremely high > > > > Hi Eyeris > > > > The boost value is simply the output of what the ScoringFilters give for a > > document. Are you using OPIC? > > > > Julien > > > > On 20 May 2015 at 19:32, Eyeris RodrIguez Rueda <eru...@uci.cu> wrote: > > > > > Hi all. > > > Im using nutch 1.9 in local mode and solr 4.10 with half million of > > > documents. > > > An adaptive fetch schedule is being used for crawl pages that changes > > > frequently. > > > I have detected that nutch is calculting a extremely high boost for some > > > documents and the document score in Solr is extremely high for these > > > documents, and > > > in consequence the order of documents is changed by this wrong boost. > > > This a correct solr output for me using "cubadebate" query: > > > ******************************* > > > { > > > "responseHeader": { > > > "status": 0, > > > "QTime": 195 > > > }, > > > "response": { > > > "numFound": 183486, > > > "start": 0, > > > "maxScore": 2.7115784, > > > "docs": [ > > > { > > > "url": "http://www.cubadebate.cu/", > > > "boost": 1.0175576, > > > "score": 2.7115784 > > > }, > > > { > > > "url": "http://www.cubadebate.cu/editores/preguntas-frecuentes/ > > ", > > > "boost": 0.11512774, > > > "score": 0.59315777 > > > }, > > > { > > > "url": "http://www.cubadebate.cu/editores/", > > > "boost": 0.16240995, > > > "score": 0.50842094 > > > }, > > > { > > > "url": "http://www.cubadebate.cu/feed/", > > > "boost": 0.8635264, > > > "score": 0.42501986 > > > }, > > > { > > > "url": "http://www.cubadebate.cu/etiqueta/cine/", > > > "boost": 0.13792185, > > > "score": 0.3541832 > > > }, > > > { > > > "url": "http://www.cubadebate.cu/web2/", > > > "boost": 0.114989564, > > > "score": 0.3389473 > > > }, > > > { > > > "url": " > > > http://www.cubadebate.cu/opinion/2015/03/06/diferencias-conciliables/", > > > "boost": 0.18748672, > > > "score": 0.28334656 > > > }, > > > { > > > "url": " > > > > > http://www.cubadebate.cu/noticias/2015/02/02/freddy-asiel-voy-por-el-desquite/ > > > ", > > > "boost": 0.13997546, > > > "score": 0.28334656 > > > }, > > > { > > > "url": " > > > http://www.cubadebate.cu/especiales/2015/03/05/querido-hugo/", > > > "boost": 0.13172969, > > > "score": 0.28334656 > > > }, > > > { > > > "url": " > > > > > http://www.cubadebate.cu/noticias/2015/02/08/grammys-la-lista-completa-de-los-ganadores/comment-page-1/ > > > ", > > > "boost": 0.12959023, > > > "score": 0.24792825 > > > } > > > ] > > > }, > > > *********************************************** > > > this a incorrect solr output using "cubadebate" query: > > > { > > > "responseHeader": { > > > "status": 0, > > > "QTime": 111 > > > }, > > > "response": { > > > "numFound": 172952, > > > "start": 0, > > > "maxScore": 22939964, > > > "docs": [ > > > { > > > "url": " > > > > > http://www.tvcubana.icrt.cu/seccion-temas/1088-yo-tambien-estoy-en-la-celac > > > ", > > > "boost": 1422334460, > > > "score": 22939964 > > > }, > > > { > > > "url": " > > > > > http://www.perlavision.icrt.cu/index.php/deportes/boxeo/14065-domadores-de-cuba-enfrentaran-a-guerreros-de-mexico-en-semifinal-de-la-v-serie-mundial-de-boxeo > > > ", > > > "boost": 1675646080, > > > "score": 22476484 > > > }, > > > { > > > "url": "http://www.radiohc.cu/noticias/deportes/page/387", > > > "boost": 1325039870, > > > "score": 21191032 > > > }, > > > { > > > "url": " > > > > > http://www.perlavision.icrt.cu/index.php/bloqueo/13922-nacera-en-mayo-engage-cuba-un-vigoroso-lobby-antibloqueo-en-congreso-de-eeuu > > > ", > > > "boost": 1663792640, > > > "score": 18730402 > > > }, > > > { > > > "url": " > > > > > http://www.perlavision.icrt.cu/index.php/deportes/boxeo/14004-cuba-en-semifinales-de-serie-mundial-el-proximo-mes > > > ", > > > "boost": 1528675840, > > > "score": 18730402 > > > }, > > > { > > > "url": "http://www.radiohc.cu/noticias/ciencias/page/76", > > > "boost": 1326217090, > > > "score": 18542152 > > > }, > > > { > > > "url": "http://www.radiohc.cu/noticias/cultura/page/272", > > > "boost": 1327128190, > > > "score": 18542152 > > > }, > > > { > > > "url": " > > > > > http://www.tvcubana.icrt.cu/archivo/118-archiv0/1060-beisbol-cubano-sera-el-tema-de-la-mesa-redonda-en-sus-emisiones-de-miercoles-y-jueves > > > ", > > > "boost": 1424298370, > > > "score": 18542152 > > > }, > > > { > > > "url": " > > > > > http://www.tvcubana.icrt.cu/archivo/118-archiv0/1073-el-programa-nacional-de-medicamentos-en-la-mesa-redonda-miercoles-y-jueves > > > ", > > > "boost": 1424231940, > > > "score": 18542152 > > > }, > > > { > > > "url": " > > > > > http://www.tvcubana.icrt.cu/archivo/118-archiv0/897-la-mesa-redonda-presentara-miercoles-y-jueves-las-cooerativas-no-agropecuarias-p > > > ", > > > "boost": 1424386690, > > > "score": 18542152 > > > } > > > ] > > > }, > > > > > > In this case the boost is extremely high, > > > i have look at solrindexer plugin and i have seen this line 123 > > > inputDoc.setDocumentBoost(doc.getWeight()); > > > > > > in IndexerMapReduce.java(src/java/org/apache/nutch/indexer) in line 316 > > > also similar things: > > > i think this increase the boost for all document. > > > // apply boost to all indexed fields. > > > doc.setWeight(boost); > > > > > > Please i really appreciated any advice or solution for this problem. > > > Thanks in advance. > > > > > > > > > > > -- > > > > Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > http://twitter.com/digitalpebble > > > > > > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >