Hi, I am curious about the functional query, did you try it and it didn't work? or was it too slow?
idf(other_field,field(term)) Thanks! roman On Mon, Jul 8, 2013 at 4:34 PM, Kathryn Mazaitis <ka...@rivard.org> wrote: > Hi All, > > Resolution: I ended up cheating. :P Though now that I look at it, I think > this was Roman's second suggestion. Thanks! > > Since the application that will be processing the IDF figures is located on > the same machine as SOLR, I opened a second IndexReader on the lucene index > and used > > reader.numDocs() > reader.docFreq(field,term) > > to generate IDF by hand, ref: http://en.wikipedia.org/wiki/Tf%E2%80%93idf > > As it turns out, using this method to get IDF on all the terms mentioned in > the set of relevant documents runs in time comparable to retrieving the > documents in the first place (so, .1-1s). This makes it fast enough that > it's no longer the slowest part of my algorithm by far. Problem solved! It > is possible that IDFValueSource would be faster; I may swap that in at a > later date. > > I will keep Mikhail's debugQuery=true in my pocket, too; that technique > would never have occurred to me. Thank you too! > > Best, > Katie > > > On Wed, Jul 3, 2013 at 11:35 PM, Roman Chyla <roman.ch...@gmail.com> > wrote: > > > Hi Kathryn, > > I wonder if you could index all your terms as separate documents and then > > construct a new query (2nd pass) > > > > q=term:term1 OR term:term2 OR term:term3 > > > > and use func to score them > > > > *idf(other_field,field(term))* > > * > > * > > the 'term' index cannot be multi-valued, obviously. > > > > Other than that, if you could do it on server side, that weould be the > > fastest - the code is ready inside IDFValueSource: > > > > > http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html > > > > roman > > > > > > On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis > > <kathryn.riv...@gmail.com>wrote: > > > > > Hi, > > > > > > I'm using SOLRJ to run a query, with the goal of obtaining: > > > > > > (1) the retrieved documents, > > > (2) the TF of each term in each document, > > > (3) the IDF of each term in the set of retrieved documents (TF/IDF > would > > be > > > fine too) > > > > > > ...all at interactive speeds, or <10s per query. This is a demo, so if > > all > > > else fails I can adjust the corpus, but I'd rather, y'know, actually do > > it. > > > > > > (1) and (2) are working; I completed the patch posted in the following > > > issue: > > > https://issues.apache.org/jira/browse/SOLR-949 > > > and am just setting tv=true&tv.tf=true for my query. This way I get > the > > > documents and the tf information all in one go. > > > > > > With (3) I'm running into trouble. I have found 2 ways to do it so far: > > > > > > Option A: set tv.df=true or tv.tf_idf for my query, and get the idf > > > information along with the documents and tf information. Since each > term > > > may appear in multiple documents, this means retrieving idf information > > for > > > each term about 20 times, and takes over a minute to do. > > > > > > Option B: After I've gathered the tf information, run through the list > of > > > terms used across the set of retrieved documents, and for each term, > run > > a > > > query like: > > > {!func}idf(text,'the_term')&deftype=func&fl=score&rows=1 > > > ...while this retrieves idf information only once for each term, the > > added > > > latency for doing that many queries piles up to almost two minutes on > my > > > current corpus. > > > > > > Is there anything I didn't think of -- a way to construct a query to > get > > > idf information for a set of terms all in one go, outside the bounds of > > > what terms happen to be in a document? > > > > > > Failing that, does anyone have a sense for how far I'd have to scale > > down a > > > corpus to approach interactive speeds, if I want this sort of data? > > > > > > Katie > > > > > >