Thanks Ameer!

Was thinking about few ideas. Thought something like tapping into Codec
extension to store multi-IDF values in 2 files, namely an IDF Meta-file & a
IDF Data-file

IDF Meta-file holds List of {UserId, Terms-Data-File-Offset} pairs for each
Term, encoded via ForUtil.

IDF Data-file that holds a "Count" & {doc_type, idf} pairs, encoded as
vInts. ["Count" is the number of vInt pairs to decode for a given UserId]

TermStats for each Term also needs to be extended to store the start
offsets pairs of {IDF Meta-file, IDF Data-file}, as vLongs

There's a possibility of long-tail occurring in IDF Meta file. That is, the
users sharing a term (possibly a common term) could be very high, in which
case we might need to generate a sampling data. But it is currently doesn't
happen in our app

This is just a quick hack & really don't have an estimate of the penalty we
have to pay for fetching this info

Not sure if this is a worthwhile idea to explore. Any input from members is
much appreciated

--
Ravi

On Tue, Dec 3, 2019 at 10:30 PM Ameer Albahem <ameer.alba...@gmail.com>
wrote:

> IDF is a simple measure to calculate. So, if building a separate index for
> each user is not an ideal solution, then I suggest you could try to
> calculate these statistics upfront. Just maintain these statistics for each
> user, then use them in the query process.
>
> As the search time, you use these stats in your ranking. One possible way
> is to write a similarity wrapper that will read the needed information from
> a hash map.
>
> Regards
> Ameer
>
>
>
> On Wed, 4 Dec 2019 at 00:55, Ravikumar Govindarajan <
> ravikumar.govindara...@gmail.com> wrote:
>
> > >
> > > it is enough to give each its own field.
> > >
> >
> > I kind of over-simplified the problem at hand. Apologies.
> >
> > DOC_TYPE is just one aspect of the problem. The other one is that, it is
> > actually shared index where there are multiple-users (100-3000 users per
> > index). There are many hundreds of such shared-indexes in our cluster
> >
> > Search happens per-user & it doesn't make sense to have a single IDF. We
> > are ideally looking at some lucene extensions/tricks to store & retrieve
> > IDF in <User/DOC_TYPE> pairs.
> >
> > Is there any reason why you are not storing each DOC_TYPE in its own
> index?
> >
> >
> > There are some common-fields across all DOC_TYPES (Ex: content/attachment
> > et al..)  & to provide unified-search for a user, we colocate them in a
> > single index
> >
> > --
> > Ravi
> >
> > On Tue, Dec 3, 2019 at 6:30 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> > dceccarel...@bloomberg.net> wrote:
> >
> > > Hi Ravi,
> > > Can you give more details on how you store an entity into lucene? what
> is
> > > a doc type?
> > > what fields do you have?
> > >
> > > Cheers
> > >
> > > From: java-user@lucene.apache.org At: 12/03/19 12:50:40To:
> > > java-user@lucene.apache.org
> > > Subject: Multi-IDF for a single term possible?
> > >
> > > Hello,
> > >
> > > We are using TF-IDF for scoring (Yet to migrate to BM25). Different
> > > entities (DOC_TYPES) are crunched & stored together in a single index.
> > >
> > > When it comes to IDF, I find that there is a single value computed
> across
> > > documents & stored as part of TermStats, whereas our documents are not
> > > homogeneous. So, a single IDF value doesn't work for us
> > >
> > > We would like to compute IDF for each <Term/DOC_TYPE> pair, store it &
> > > later use the paired-IDF values during query time. Is something like
> this
> > > possible via Codecs or other mechanisms?
> > >
> > > Any help is much appreciated
> > >
> > > --
> > > Ravi
> > >
> > >
> > >
> >
>

Reply via email to