Andrzej Bialecki wrote:
* the DefaultSimilarity seems to excessively favor small lengths of "content" (high tf) and anchor texts (too high boost value?).
NutchSimilarity.lengthNorm() penalize short content by considering all documents with less than 1000 content tokens to be normalized as though they have 1000 content tokens. Is this not sufficient?
Not in my experience. Please consider the following hits (attached in a file), ordered by score, which I've got from a 5mln pages index of mostly Swedish sites, for a query "apoteket" ("the pharmacy" in Swedish). There is clearly something very wrong with the second hit.
I don't object to indexing titles in a separate field. They can be high quality, but they can also be spammed more easily than anchors. In any case, separately controlling their boost, length normalization, etc. is probably a good idea.
Ok, I'll prepare a patch for review.
* for the url field it's not the same whether the query terms occur in the domain name, or in the file path name in the url. The former is usually more important, because it's more likely to point to a referebce site, and IMHO should be boosted separately. The latter usually indicates a reference page. We could differentiate between the two by adding a "domain" field as unstored, tokenized and indexed field, and to modify the BasicQueryFilter accordingly to use this field in order to boost up reference sites.
I'm not exactly sure how you'd use this. Why not just boost pages at reference sites? That does not require a new field.
Example: all other things being equal (i.e. the content and anchors), which url seems to be more representative for the query "ikea":
http://www.ikea.se/something/else.html http://www.something.se/else/ikea.html
IMHO the first url should be given a higher score. Currently they get the same score.
Also, to offer more flexibility in searching I would propose to index the values of primaryType and secondaryType. This would enable searching for content of specific mime type. Currently these fields are only stored, but not indexed.
I think John recently added a plugin that does this, right?
Right. I wrote this email a few days ago, and John is simply too fast... ;-)
-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------------------------------------- 1. [sv] Apoteket.se - Apoteket AB:s webbplats Apoteket.se - Apoteket AB:s webbplats Apoteket.se �r uppbyggd med ramar ... http://www.apoteket.se/ - 4kB - cached - explain
anchors.jsp: page: http://www.apoteket.se/ incoming anchor text: * APOTEKET (190 lines more, like the above ...) * www.patientpass.se www.apoteket.se * www.apoteket.se explain.jsp: * docNo = 3a880 * segment = 20041025065231-0 * digest = a4876ffa6a148e1d08b4fdb633d1469a * boost = 9.475205 * contentLength = 4234 * primaryType = text * subType = html * lang = sv * url = http://www.apoteket.se/ * title = Apoteket.se - Apoteket AB:s webbplats score for query: apoteket * 123.389946 = sum of: o 14.722097 = weight(url:apoteket^4.0 in 2163450), product of: + 0.892789 = queryWeight(url:apoteket^4.0), product of: # 4.0 = boost # 13.192006 = idf(docFreq=26) # 0.016919129 = queryNorm + 16.490007 = fieldWeight(url:apoteket in 2163450), product of: # 1.0 = tf(termFreq(url:apoteket)=1) # 13.192006 = idf(docFreq=26) # 1.25 = fieldNorm(field=url, doc=2163450) o 108.1546 = weight(anchor:apoteket^2.0 in 2163450), product of: + 0.42763758 = queryWeight(anchor:apoteket^2.0), product of: # 2.0 = boost # 12.637695 = idf(docFreq=46) # 0.016919129 = queryNorm + 252.91183 = fieldWeight(anchor:apoteket in 2163450), product of: # 13.341664 = tf(termFreq(anchor:apoteket)=178) # 12.637695 = idf(docFreq=46) # 1.5 = fieldNorm(field=anchor, doc=2163450) o 0.5132425 = weight(content:apoteket in 2163450), product of: + 0.14161198 = queryWeight(content:apoteket), product of: # 8.369933 = idf(docFreq=3353) # 0.016919129 = queryNorm + 3.6242874 = fieldWeight(content:apoteket in 2163450), product of: # 1.7320508 = tf(termFreq(content:apoteket)=3) # 8.369933 = idf(docFreq=3353) # 0.25 = fieldNorm(field=content, doc=2163450) -------------------------------------------------------------------------------------- 2. [nl] http://shorl.com/hitrehynavisti http://shorl.com/hitrehynavisti - cached - explain anchors.jsp: page: http://shorl.com/hitrehynavisti incoming anchor text: * Apoteket explain.jsp: * docNo = 609d * segment = 20050102021321 * digest = 12ef9b61e46cae1f98ac9fa3bc3efff9 * boost = 1.3132616 * primaryType = text * subType = html * lang = nl * url = http://shorl.com/hitrehynavisti * title = score for query: apoteket * 4.7288094 = weight(anchor:apoteket^2.0 in 5105496), product of: o 0.42763758 = queryWeight(anchor:apoteket^2.0), product of: + 2.0 = boost + 12.637695 = idf(docFreq=46) + 0.016919129 = queryNorm o 11.057983 = fieldWeight(anchor:apoteket in 5105496), product of: + 1.0 = tf(termFreq(anchor:apoteket)=1) + 12.637695 = idf(docFreq=46) + 0.875 = fieldNorm(field=anchor, doc=5105496) -------------------------------------------------------------------------------------- 3. [sv] Apoteket.se - Apoteket AB:s webbplats Apoteket.se - Apoteket AB:s webbplats Apoteket.se �r uppbyggd med ramar ... http://www.apoteket.com/egenvard/vitaminer/b_tiamin.html - 4kB - cached - explain anchors.jsp: page: http://www.apoteket.com/egenvard/vitaminer/b_tiamin.html incoming anchor text: explain.jsp: * docNo = 107f8 * segment = 20041228151638 * digest = a4876ffa6a148e1d08b4fdb633d1469a * boost = 1.0 * contentLength = 4234 * primaryType = text * subType = html * lang = sv * url = http://www.apoteket.com/egenvard/vitaminer/b_tiamin.html * title = Apoteket.se - Apoteket AB:s webbplats score for query: apoteket * 4.144033 = sum of: o 0.7361049 = weight(url:apoteket^4.0 in 4748338), product of: + 0.892789 = queryWeight(url:apoteket^4.0), product of: # 4.0 = boost # 13.192006 = idf(docFreq=26) # 0.016919129 = queryNorm + 0.8245004 = fieldWeight(url:apoteket in 4748338), product of: # 1.0 = tf(termFreq(url:apoteket)=1) # 13.192006 = idf(docFreq=26) # 0.0625 = fieldNorm(field=url, doc=4748338) o 3.3437731 = weight(anchor:apoteket^2.0 in 4748338), product of: + 0.42763758 = queryWeight(anchor:apoteket^2.0), product of: # 2.0 = boost # 12.637695 = idf(docFreq=46) # 0.016919129 = queryNorm + 7.8191752 = fieldWeight(anchor:apoteket in 4748338), product of: # 1.4142135 = tf(termFreq(anchor:apoteket)=2) # 12.637695 = idf(docFreq=46) # 0.4375 = fieldNorm(field=anchor, doc=4748338) o 0.06415531 = weight(content:apoteket in 4748338), product of: + 0.14161198 = queryWeight(content:apoteket), product of: # 8.369933 = idf(docFreq=3353) # 0.016919129 = queryNorm + 0.45303592 = fieldWeight(content:apoteket in 4748338), product of: # 1.7320508 = tf(termFreq(content:apoteket)=3) # 8.369933 = idf(docFreq=3353) # 0.03125 = fieldNorm(field=content, doc=4748338)
