Looks like this could be a very easy addition to TermsComponent? From what
I read in the code, it uses TermContext to compute/hold the stats, and the
latter already has docFreq and totalTermFreq (!!). It's just that
TermsComponent does not output TTF (only computes it...):

    for(int i=0; i<terms.length; i++) {
      if(termContexts[i] != null) {
        String outTerm =
fieldType.indexedToReadable(terms[i].bytes().utf8ToString());
        int docFreq = termContexts[i].docFreq();
        termsMap.add(outTerm, docFreq);
      }
    }


On Wed, Feb 22, 2017 at 5:34 PM Joel Bernstein <joels...@gmail.com> wrote:

> Yeah, I think expanding the functionality of the terms component looks
> like the right place to add these stats.
>
> I plan on exposing these types of terms stats as Streaming Expression
> functions but I would likely use the terms component under the covers.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Feb 22, 2017 at 8:56 AM, Shai Erera <ser...@gmail.com> wrote:
>
> No, they are not global distributed stats. I am willing to live with
> approximated stats though (unless again, there's an API which can give me
> both). I wonder why doesn't Terms component return ttf in addition to
> docfreq. The API (at the Lucene level) is right there already.
>
> On Wed, Feb 22, 2017 at 3:49 PM Joel Bernstein <joels...@gmail.com> wrote:
>
> Hi Shai,
>
> Do ttf and docfreq return global stats in distributed mode? I wasn't aware
> that there was a mechanism for aggregating values in the field list.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Feb 22, 2017 at 7:18 AM, Shai Erera <ser...@gmail.com> wrote:
>
> Hi
>
> I am currently using function queries to obtain these two statistics, as I
> didn't see a better or more explicit API and the Terms component only
> returns docFreq, but not totalTermFreq.
>
> The way I use the API is submit requests as follows:
>
> curl "
> http://localhost:8983/solr/mycollection/select?q=*:*&rows=1&fl=ttf(text,'t1'),docfreq(text,'t1
> ')"
>
> Today I noticed that it sometimes returns 0 for these stats for existing
> terms. After debugging and going through the code, I noticed that it
> performs analysis on the value that's given. So if I provide an already
> stemmed value, it analyzes the value further and in some cases it results
> in a non-existing term (and in other cases I get stats for a term I didn't
> ask for).
>
> I want to get the stats of the indexed version of the terms, and that's
> why I send the already stemmed one. In my case I tried to get the stats for
> the term 'disguis' which is the stem of 'disguise' and 'disguised', however
> it further analyzed the value to 'disgui' (per the analysis chain) and that
> term does not exist in the index.
>
> So first question is -- is this the right API to retrieve such statistics?
> I didn't find another one, but could be I missed it.
>
> If it is, why does it analyze the value? I tried to wrap the value with
> single and double quotes, but of course that does not affect the analysis
> ... is analysis an intended behavior or a bug?
>
> Shai
>
>
>
>

Reply via email to