The field is our main textual field. In the standard case, the
length-normalization makes a significant work with tf-idf, we don't want to
avoid it.

Removing duplicates won't help here, because the terms are not dup. One
term is stemmed, and the other is not.


On Fri, Dec 6, 2013 at 9:48 AM, Ahmet Arslan <iori...@yahoo.com> wrote:

> Hi Isaac,
>
> Did you consider omitting norms completely for that field? omitNorms="true"
> Are you using solr.RemoveDuplicatesTokenFilterFactory?
>
>
>
> On Thursday, December 5, 2013 8:55 PM, Isaac Hebsh <isaac.he...@gmail.com>
> wrote:
>
> Hi,
> we implemented a morphologic analyzer, which stems words on index time.
> For some reasons, we index both the original word and the stem (on the same
> position, of course).
> The stemming is done on a specific language, so other languages are not
> stemmed at all.
>
> Because of that, two documents with the same amount of terms, may have
> different termVector size. document which contains many words that being
> stemmed, will have a double sized termVector. This behaviour affects the
> relevance score in a BAD way. the fieldNorm of these documents reduces
> thier score. This is NOT the wanted behaviour in our case.
>
> We are looking for a way to "mark" the stemmed words (on index time, of
> course) so they won't affect the fieldNorm. Do such a way exist?
>
> Do you have another idea?
>

Reply via email to