Re: Significant Term aggregation

Ramdev Wudali Fri, 02 May 2014 06:32:42 -0700

Hi Mark:
   Thanks for the update. 
The corpus I am searching against is a news feed corpus and the number of 
documents  are not really that small. (some queries return in the result 
set over 400K docs). and these being news articles, the documents are not 
short twitter like sentences.  Most of my query results have at least 10's 
of thousands of documents if not more.


your second concern that the query criteria is not identifying a result set 
with any sense of cohesion might be true. Basically  the search I am 
executing is a filter. Either the document metadata either has the value or 
not. Hence the result set may not be "cohesive". The reason for me to use 
the Significant terms is so that the query can be enhanced to provide a 
more cohesive set of documents. 

I am using the standard stop words list that comes with ES and have not 
added to or removed from it.  
I am, also , not querying across multiple indicies/types. (there is only 
one index with one type within the index) 

I will watch the video and see if I can get some ideas to improve my 
queries. 

All in all I find the new aggregations feature quite helpful. (at least to 
generate some descriptive analytics)

Cheers

Ramdev







On Thursday, 1 May 2014 10:04:15 UTC-5, Mark Harwood wrote:
>
> Thanks for the feedback, Ramdev.
>
>
> What I noticed in my aggregation results is  a lot of Stopwords (a, an, 
>> the, at, and, etc.) being included as significant terms. 
>>
>
> These sorts of terms shouldn't really need any sort of special treatment. 
> If they are appearing as suggestions then I expect one of the following 
> statements to be true:
>
> 1) You have a very small number of docs in the result set representing the 
> "foreground" sample. Significant terms needs a reasonable number of docs in 
> a sample to draw any real conclusions
> 2) You have query criteria that is not identifying a result set with any 
> sense of cohesion e.g. a query for random docs
> 3) You have changed the set of stopwords in use in your index - what 
> previously never used to appear at all is now suddenly common or 
> vice-versa. 
> 4) You are querying across mixed indices or doc-types (one with 
> stop-words, one without) and we fail to tune-out the stopwords as part of 
> the results merging process because one small index reports them back as 
> commonplace while another large index has them as missing or rare. In the 
> merged stats they therefore appear to be highly correlated with your query 
> request.
>
> Please let me know if none of these scenarios explain your results.
>
>  
>
>> Another possible enhancement would be get a phrase significance (instead 
>> of a single term, doing a multi term significance) would be nice. 
>>
>
>
> I outline some of the possibilities in creating phrases from significant 
> terms, starting 51 mins into this recent video: 
> https://skillsmatter.com/skillscasts/5175-revealing-the-uncommonly-common-with-elasticsearch
>  
>
>>
>> Cheers and Thanks for all the fish
>>
>
> You're welcome and thanks again for the feedback
> Mark 
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/e74823bd-1f54-4c9d-88fb-62406ca46a9f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Significant Term aggregation

Reply via email to