The field is our main textual field. In the standard case, the length-normalization makes a significant work with tf-idf, we don't want to avoid it.
Removing duplicates won't help here, because the terms are not dup. One term is stemmed, and the other is not. On Fri, Dec 6, 2013 at 9:48 AM, Ahmet Arslan <iori...@yahoo.com> wrote: > Hi Isaac, > > Did you consider omitting norms completely for that field? omitNorms="true" > Are you using solr.RemoveDuplicatesTokenFilterFactory? > > > > On Thursday, December 5, 2013 8:55 PM, Isaac Hebsh <isaac.he...@gmail.com> > wrote: > > Hi, > we implemented a morphologic analyzer, which stems words on index time. > For some reasons, we index both the original word and the stem (on the same > position, of course). > The stemming is done on a specific language, so other languages are not > stemmed at all. > > Because of that, two documents with the same amount of terms, may have > different termVector size. document which contains many words that being > stemmed, will have a double sized termVector. This behaviour affects the > relevance score in a BAD way. the fieldNorm of these documents reduces > thier score. This is NOT the wanted behaviour in our case. > > We are looking for a way to "mark" the stemmed words (on index time, of > course) so they won't affect the fieldNorm. Do such a way exist? > > Do you have another idea? >