Hi all, I was trying out the MoreLikeThis support, and getting some odd results.
I realized that unless the fields being used for similarity calculation have a stored term vector, the MoreLikeThis code from Lucene will re-analyze the field using the StandardAnalyzer. Which, in my case, is quite different from what I'm using in the Solr schema.
So the first note is just for anybody using MoreLikeThis, make sure you also specify termVectors=true in the Solr schema for any fields being passed to the query as mlt.fl parameters.
The second note is that the Wiki page and the example schema might want to include some reference to the termVectors field attribute. For example, the sample schema says:
<!-- Valid attributes for fields: name: mandatory - the name for the fieldtype: mandatory - the name of a previously defined type from the <types> sectionindexed: true if this field should be indexed (searchable or sortable) stored: true if this field should be retrievable compressed: [false] if this field should be stored using gzip compression (this will only apply if the field type is compressable; among the standard field types, only TextField and StrField are) multiValued: true if this field may contain multiple values per document omitNorms: (expert) set to true to omit the norms associated with this field (this disables length normalization and index-time boosting for the field, and saves some memory). Only full-text fields or fields that need an index-time boost need norms.
Which made me think initially these were the only valid attributes for fields. Likewise the wiki page at http://wiki.apache.org/solr/SchemaXml also doesn't make any mention of termVectors, termPositions, or termOffsets. I would edit that page, but there currently isn't a section that talks about all the attributes, only the common ones.
Thanks, -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "Find Code, Find Answers"
