Too many unique terms

Manuel Le Normand Tue, 23 Apr 2013 12:53:49 -0700

Hi there,
Looking at one of my shards (about 1M docs) i see lot of unique terms, more
than 8M which is a significant part of my total term count. These are very
likely useless terms, binaries or other meaningless numbers that come with
few of my docs.
I am totally fine with deleting them so these terms would be unsearchable.
Thinking about it i get that
1. It is impossible apriori knowing if it is unique term or not, so i
cannot add them to my stop words.
2. I have a performance decrease cause my cached chuncks do contain useless
data, and im short on memory.


Assuming a constant index, is there a way of deleting all terms that are
unique from at least the dictionary tim and tip files? Will i get
significant query time performance increase? Does any body know a class of
regex that identify meaningless terms that i can add to my updateProcessor?

Thanks
Manu

Too many unique terms

Reply via email to