Re: What are the options for obtaining IDF at interactive speeds?

Roman Chyla Mon, 08 Jul 2013 13:46:36 -0700

Hi,
I am curious about the functional query, did you try it and it didn't work?
 or was it too slow?


idf(other_field,field(term))

Thanks!

  roman


On Mon, Jul 8, 2013 at 4:34 PM, Kathryn Mazaitis <ka...@rivard.org> wrote:

> Hi All,
>
> Resolution: I ended up cheating. :P Though now that I look at it, I think
> this was Roman's second suggestion. Thanks!
>
> Since the application that will be processing the IDF figures is located on
> the same machine as SOLR, I opened a second IndexReader on the lucene index
> and used
>
> reader.numDocs()
> reader.docFreq(field,term)
>
> to generate IDF by hand, ref: http://en.wikipedia.org/wiki/Tf%E2%80%93idf
>
> As it turns out, using this method to get IDF on all the terms mentioned in
> the set of relevant documents runs in time comparable to retrieving the
> documents in the first place (so, .1-1s). This makes it fast enough that
> it's no longer the slowest part of my algorithm by far. Problem solved! It
> is possible that IDFValueSource would be faster; I may swap that in at a
> later date.
>
> I will keep Mikhail's debugQuery=true in my pocket, too; that technique
> would never have occurred to me. Thank you too!
>
> Best,
> Katie
>
>
> On Wed, Jul 3, 2013 at 11:35 PM, Roman Chyla <roman.ch...@gmail.com>
> wrote:
>
> > Hi Kathryn,
> > I wonder if you could index all your terms as separate documents and then
> > construct a new query (2nd pass)
> >
> > q=term:term1 OR term:term2 OR term:term3
> >
> > and use func to score them
> >
> > *idf(other_field,field(term))*
> > *
> > *
> > the 'term' index cannot be multi-valued, obviously.
> >
> > Other than that, if you could do it on server side, that weould be the
> > fastest - the code is ready inside IDFValueSource:
> >
> >
> http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html
> >
> > roman
> >
> >
> > On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis
> > <kathryn.riv...@gmail.com>wrote:
> >
> > > Hi,
> > >
> > > I'm using SOLRJ to run a query, with the goal of obtaining:
> > >
> > > (1) the retrieved documents,
> > > (2) the TF of each term in each document,
> > > (3) the IDF of each term in the set of retrieved documents (TF/IDF
> would
> > be
> > > fine too)
> > >
> > > ...all at interactive speeds, or <10s per query. This is a demo, so if
> > all
> > > else fails I can adjust the corpus, but I'd rather, y'know, actually do
> > it.
> > >
> > > (1) and (2) are working; I completed the patch posted in the following
> > > issue:
> > > https://issues.apache.org/jira/browse/SOLR-949
> > > and am just setting tv=true&tv.tf=true for my query. This way I get
> the
> > > documents and the tf information all in one go.
> > >
> > > With (3) I'm running into trouble. I have found 2 ways to do it so far:
> > >
> > > Option A: set tv.df=true or tv.tf_idf for my query, and get the idf
> > > information along with the documents and tf information. Since each
> term
> > > may appear in multiple documents, this means retrieving idf information
> > for
> > > each term about 20 times, and takes over a minute to do.
> > >
> > > Option B: After I've gathered the tf information, run through the list
> of
> > > terms used across the set of retrieved documents, and for each term,
> run
> > a
> > > query like:
> > > {!func}idf(text,'the_term')&deftype=func&fl=score&rows=1
> > > ...while this retrieves idf information only once for each term, the
> > added
> > > latency for doing that many queries piles up to almost two minutes on
> my
> > > current corpus.
> > >
> > > Is there anything I didn't think of -- a way to construct a query to
> get
> > > idf information for a set of terms all in one go, outside the bounds of
> > > what terms happen to be in a document?
> > >
> > > Failing that, does anyone have a sense for how far I'd have to scale
> > down a
> > > corpus to approach interactive speeds, if I want this sort of data?
> > >
> > > Katie
> > >
> >
>

Re: What are the options for obtaining IDF at interactive speeds?

Reply via email to