Re: Significant Term aggregation

Mark Harwood Fri, 02 May 2014 07:07:44 -0700


your second concern that the query criteria is not identifying a result set 
> with any sense of cohesion might be true. Basically  the search I am 
> executing is a filter. Either the document metadata either has the value or 
> not. Hence the result set may not be "cohesive". The reason for me to use 
> the Significant terms is so that the query can be enhanced to provide a 
> more cohesive set of documents. 
>


We can probably debug that from the results of the agg. For each 
"significant" term you should get a score and all the ingredients that went 
into it are also available:
1) The number of docs in the result set with the given term
2) The size of your result set
3) The number of docs in the index with the given term (see the "bg_count" 
value)
4) The size of the index 

In a "cohesive" set you should see a reasonable difference in the term 
probabilities e.g. the numbers 1/2  vs 3/4  
If all you've selected in your query is effectively random docs with no 
common theme then the use of words in background and foreground barely 
differ and 1/2 vs 3/4 are practically the same giving a poor-scoring set of 
results.

 

>
>
>
>
>
>
> On Thursday, 1 May 2014 10:04:15 UTC-5, Mark Harwood wrote:
>>
>> Thanks for the feedback, Ramdev.
>>
>>
>> What I noticed in my aggregation results is  a lot of Stopwords (a, an, 
>>> the, at, and, etc.) being included as significant terms. 
>>>
>>
>> These sorts of terms shouldn't really need any sort of special treatment. 
>> If they are appearing as suggestions then I expect one of the following 
>> statements to be true:
>>
>> 1) You have a very small number of docs in the result set representing 
>> the "foreground" sample. Significant terms needs a reasonable number of 
>> docs in a sample to draw any real conclusions
>> 2) You have query criteria that is not identifying a result set with any 
>> sense of cohesion e.g. a query for random docs
>> 3) You have changed the set of stopwords in use in your index - what 
>> previously never used to appear at all is now suddenly common or 
>> vice-versa. 
>> 4) You are querying across mixed indices or doc-types (one with 
>> stop-words, one without) and we fail to tune-out the stopwords as part of 
>> the results merging process because one small index reports them back as 
>> commonplace while another large index has them as missing or rare. In the 
>> merged stats they therefore appear to be highly correlated with your query 
>> request.
>>
>> Please let me know if none of these scenarios explain your results.
>>
>>  
>>
>>> Another possible enhancement would be get a phrase significance (instead 
>>> of a single term, doing a multi term significance) would be nice. 
>>>
>>
>>
>> I outline some of the possibilities in creating phrases from significant 
>> terms, starting 51 mins into this recent video: 
>> https://skillsmatter.com/skillscasts/5175-revealing-the-uncommonly-common-with-elasticsearch
>>  
>>
>>>
>>> Cheers and Thanks for all the fish
>>>
>>
>> You're welcome and thanks again for the feedback
>> Mark 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/25602f15-42ab-4857-9880-509d66a1a818%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Significant Term aggregation

Reply via email to