Hello Jigal - reading OPIC gives it away. You can check Nutch record, they must 
have a very high score, which is added to the NutchDocument as boost field. If, 
in Solr, you actually use it, this is what you get. Do not use OPIC, unless you 
have a reason to.

Markus

 
 
-----Original message-----
> From:Jigal van Hemert | alterNET internet BV <ji...@alternet.nl>
> Sent: Tuesday 10th March 2015 9:32
> To: user <user@nutch.apache.org>
> Subject: Nutch documents have huge scores in Solr
> 
> Hi,
> 
> We use Nutch to index "external" sites along with a TYPO3 extension that
> sends the page content from the CMS to the same Solr server. The author of
> that extension has also made a configuration for Nutch with a few extra
> plugins which add some extra fields to make the data compatible with the
> documents that come from the CMS.
> 
> https://github.com/dkd/nutch-typo3-cms
> 
> This combination has worked fine for quite a few installations, but in one
> installation the Nutch documents always end up with the highest scores. A
> few days ago I heard from someone else that he had the same problem;
> completely different websites, only the TYPO3 CMS extension and the Nutch
> configuration are almost identical (apart from the site specific settings).
> 
> To rule out any boosting, query field and other settings I created some
> simple queries in the Solr 4.8.1 admin interface by just supplying a single
> search word. Below are some of the results for the word "afval" (Dutch for
> "garbage").
> What is remarkable is the huge difference  for the fieldNorm values which
> seem to be the cause for the extreme differences in scores (CMS content
> scored between 0 and 2.3; Nutch documents scored between 6000 and 150,000
> (rough numbers)).
> 
> I learned that the plugin "scoring-opic" is used to add scores to the Nutch
> documents. This seems to work fine in most cases.
> 
> Any pointers as to why this results in mega-scores are very much welcome.
> 
> List of the debugQuery output ("domain" is the placeholder of the actual
> domain name):
> 
> [... lots of nutch records skipped ...]
> <str name="c293c0a7c8d3311249d309c91f39e5e5b192b6c0/tx_nutch_external/
> https://domain/Loket/prodcat/products/getProductDetailsAction.do?name=Asbestverwijdering+bedrijfsmatig";>
> 
> 14760.001 = (MATCH) sum of:
>   14760.001 = (MATCH) max of:
>     14760.001 = (MATCH) weight(content:afval^40.0 in 6617), product of:
>       0.99999994 = queryWeight(content:afval^40.0), product of:
>         40.0 = boost
>         4.804688 = idf(docFreq=168, maxDocs=7590)
>         0.0052032513 = queryNorm
>       14760.002 = (MATCH) fieldWeight(content:afval in 6617), product of:
>         1.0 = tf(termFreq(content:afval)=1)
>         4.804688 = idf(docFreq=168, maxDocs=7590)
>         3072.0 = fieldNorm(field=content, doc=6617)
> </str><str name="c293c0a7c8d3311249d309c91f39e5e5b192b6c0/tx_nutch_external/
> https://domain/Loket/knowledgebase/faqs/getFaqContentAction.do?id=725";>
> 6150.0 = (MATCH) sum of:
>   6150.0 = (MATCH) max of:
>     6150.0 = (MATCH) weight(content:afval^40.0 in 5877), product of:
>       0.99999994 = queryWeight(content:afval^40.0), product of:
>         40.0 = boost
>         4.804688 = idf(docFreq=168, maxDocs=7590)
>         0.0052032513 = queryNorm
>       6150.0005 = (MATCH) fieldWeight(content:afval in 5877), product of:
>         1.0 = tf(termFreq(content:afval)=1)
>         4.804688 = idf(docFreq=168, maxDocs=7590)
>         1280.0 = fieldNorm(field=content, doc=5877)
> </str><str
> name="102b19e401862068820dd53b4a1beccb286f03a7/pages/27363/0/0/0">
> 2.1233919 = (MATCH) sum of:
>   2.1233919 = (MATCH) max of:
>     2.1233919 = (MATCH) weight(content:afval^40.0 in 493), product of:
>       0.99999994 = queryWeight(content:afval^40.0), product of:
>         40.0 = boost
>         4.804688 = idf(docFreq=168, maxDocs=7590)
>         0.0052032513 = queryNorm
>       2.123392 = (MATCH) fieldWeight(content:afval in 493), product of:
>         1.4142135 = tf(termFreq(content:afval)=2)
>         4.804688 = idf(docFreq=168, maxDocs=7590)
>         0.3125 = fieldNorm(field=content, doc=493)
>     1.1733533 = (MATCH) weight(title:afval^5.0 in 493), product of:
>       0.17471766 = queryWeight(title:afval^5.0), product of:
>         5.0 = boost
>         6.715711 = idf(docFreq=24, maxDocs=7590)
>         0.0052032513 = queryNorm
>       6.715711 = (MATCH) fieldWeight(title:afval in 493), product of:
>         1.0 = tf(termFreq(title:afval)=1)
>         6.715711 = idf(docFreq=24, maxDocs=7590)
>         1.0 = fieldNorm(field=title, doc=493)
>     1.500486 = (MATCH) weight(tagsH2H3:afval^3.0 in 493), product of:
>       0.11628768 = queryWeight(tagsH2H3:afval^3.0), product of:
>         3.0 = boost
>         7.4496803 = idf(docFreq=11, maxDocs=7590)
>         0.0052032513 = queryNorm
>       12.903225 = (MATCH) fieldWeight(tagsH2H3:afval in 493), product of:
>         1.7320508 = tf(termFreq(tagsH2H3:afval)=3)
>         7.4496803 = idf(docFreq=11, maxDocs=7590)
>         1.0 = fieldNorm(field=tagsH2H3, doc=493)
> </str><str
> name="102b19e401862068820dd53b4a1beccb286f03a7/pages/7844/0/0/0">
> 1.7667065 = (MATCH) sum of:
>   1.7667065 = (MATCH) max of:
>     1.1917508 = (MATCH) weight(content:afval^40.0 in 3750), product of:
>       0.99999994 = queryWeight(content:afval^40.0), product of:
>         40.0 = boost
>         4.804688 = idf(docFreq=168, maxDocs=7590)
>         0.0052032513 = queryNorm
>       1.1917509 = (MATCH) fieldWeight(content:afval in 3750), product of:
>         2.6457512 = tf(termFreq(content:afval)=7)
>         4.804688 = idf(docFreq=168, maxDocs=7590)
>         0.09375 = fieldNorm(field=content, doc=3750)
>     1.1733533 = (MATCH) weight(title:afval^5.0 in 3750), product of:
>       0.17471766 = queryWeight(title:afval^5.0), product of:
>         5.0 = boost
>         6.715711 = idf(docFreq=24, maxDocs=7590)
>         0.0052032513 = queryNorm
>       6.715711 = (MATCH) fieldWeight(title:afval in 3750), product of:
>         1.0 = tf(termFreq(title:afval)=1)
>         6.715711 = idf(docFreq=24, maxDocs=7590)
>         1.0 = fieldNorm(field=title, doc=3750)
>     1.7667065 = (MATCH) weight(keywords:afval^2.0 in 3750), product of:
>       0.08663568 = queryWeight(keywords:afval^2.0), product of:
>         2.0 = boost
>         8.325149 = idf(docFreq=4, maxDocs=7590)
>         0.0052032513 = queryNorm
>       20.392366 = (MATCH) fieldWeight(keywords:afval in 3750), product of:
>         2.4494898 = tf(termFreq(keywords:afval)=6)
>         8.325149 = idf(docFreq=4, maxDocs=7590)
>         1.0 = fieldNorm(field=keywords, doc=3750)
>     1.500486 = (MATCH) weight(tagsH2H3:afval^3.0 in 3750), product of:
>       0.11628768 = queryWeight(tagsH2H3:afval^3.0), product of:
>         3.0 = boost
>         7.4496803 = idf(docFreq=11, maxDocs=7590)
>         0.0052032513 = queryNorm
>       12.903225 = (MATCH) fieldWeight(tagsH2H3:afval in 3750), product of:
>         1.7320508 = tf(termFreq(tagsH2H3:afval)=3)
>         7.4496803 = idf(docFreq=11, maxDocs=7590)
>         1.0 = fieldNorm(field=tagsH2H3, doc=3750)
> </str>
> [... lots of page documents skipped ...]
> 
> 
> -- 
> 
> 
> Met vriendelijke groet,
> 
> 
> Jigal van Hemert | Ontwikkelaar
> 
> 
> 
> Langesteijn 124
> 3342LG Hendrik-Ido-Ambacht
> 
> T. +31 (0)78 635 1200
> F. +31 (0)848 34 9697
> KvK. 23 09 28 65
> 
> ji...@alternet.nl
> www.alternet.nl
> 
> 
> Disclaimer:
> Dit bericht (inclusief eventuele bijlagen) kan vertrouwelijke informatie
> bevatten. Als u niet de beoogde ontvanger bent van dit bericht, neem dan
> direct per e-mail of telefoon contact op met de verzender en verwijder dit
> bericht van uw systeem. Het is niet toegestaan de inhoud van dit bericht op
> welke wijze dan ook te delen met derden of anderszins openbaar te maken
> zonder schriftelijke toestemming van alterNET Internet BV. U wordt
> geadviseerd altijd bijlagen te scannen op virussen. AlterNET kan op geen
> enkele wijze verantwoordelijk worden gesteld voor geleden schade als gevolg
> van virussen.
> 
> Alle eventueel genoemde prijzen S.E. & O., excl. 21% BTW, excl. reiskosten.
> Op al onze prijsopgaven, offertes, overeenkomsten, en diensten zijn, met
> uitzondering van alle andere voorwaarden, de Algemene Voorwaarden van
> alterNET Internet B.V. van toepassing. Op al onze domeinregistraties en
> hostingactiviteiten zijn tevens onze aanvullende hostingvoorwaarden van
> toepassing. Dit bericht is uitsluitend bedoeld voor de geadresseerde. Aan
> dit bericht kunnen geen rechten worden ontleend.
> 
> ! Bedenk voordat je deze email uitprint, of dit werkelijk nodig is !
> 

Reply via email to