> > it is enough to give each its own field. > I kind of over-simplified the problem at hand. Apologies.
DOC_TYPE is just one aspect of the problem. The other one is that, it is actually shared index where there are multiple-users (100-3000 users per index). There are many hundreds of such shared-indexes in our cluster Search happens per-user & it doesn't make sense to have a single IDF. We are ideally looking at some lucene extensions/tricks to store & retrieve IDF in <User/DOC_TYPE> pairs. Is there any reason why you are not storing each DOC_TYPE in its own index? There are some common-fields across all DOC_TYPES (Ex: content/attachment et al..) & to provide unified-search for a user, we colocate them in a single index -- Ravi On Tue, Dec 3, 2019 at 6:30 PM Diego Ceccarelli (BLOOMBERG/ LONDON) < [email protected]> wrote: > Hi Ravi, > Can you give more details on how you store an entity into lucene? what is > a doc type? > what fields do you have? > > Cheers > > From: [email protected] At: 12/03/19 12:50:40To: > [email protected] > Subject: Multi-IDF for a single term possible? > > Hello, > > We are using TF-IDF for scoring (Yet to migrate to BM25). Different > entities (DOC_TYPES) are crunched & stored together in a single index. > > When it comes to IDF, I find that there is a single value computed across > documents & stored as part of TermStats, whereas our documents are not > homogeneous. So, a single IDF value doesn't work for us > > We would like to compute IDF for each <Term/DOC_TYPE> pair, store it & > later use the paired-IDF values during query time. Is something like this > possible via Codecs or other mechanisms? > > Any help is much appreciated > > -- > Ravi > > >
