Re: Significant Term aggregation

Mark Harwood Fri, 02 May 2014 15:38:19 -0700

So there's potentially several things going on here:

1) Your query may be too broad - depending on how your analysis is set up 
you are likely querying for [fuel] OR [cell] OR [battery] as independent 
words meaning you'll match a lot of docs e.g. those mentioning only "fuel 
prices" etc. This reduces the "cohesion" of the topics covered in the 
result set. Consider use of ANDs or phrases on free-text queries or use 
untokenized category fields to tighten up the result set.
2) Some of your docs look to cover many diverse topics in one doc e.g. this 
one mentions fuel, facebook and a 
drugstore: http://www.marketintelligencecenter.com/articles/469889 Can 
these multi-story pages be filtered out somehow?
3) Do your bodies have standard "boilerplate" text common to many pages? 
e.g. the author's biography as shown here: 
http://www.marketintelligencecenter.com/articles/498395 If so then the 
repetition of a common passage may make certain words undesirably highly 
correlated with a topic because the author who covers that industry sector 
likely has his biography in every related page and words from his biography 
e.g. a university will be skewed in that industry sector.


So reasonably clean, on-topic data is required to derive anything sensible 
using this statistical approach.



On Wednesday, April 30, 2014 7:54:17 PM UTC+1, Ramdev Wudali wrote:
>
> Hi:
>    I have been trying to use (and successfully did) the Significant terms 
> aggregations in release 1.1.0. The blog posts about this feature
> http://www.elasticsearch.org/blog/significant-terms-aggregation/ was 
> extremely helpful. Since this feature is in experimental stage and the 
> authors had requested feedback and me not knowing about how to provide 
> feedback regarding specific features, I am restarting to posting on this 
> group.
>
> I had posted on a different thread regarding accessing the TFIDF scores 
> for terms so that I could investigate ways in which I could enhance my 
> queries. This lead me to look at the experimental Significant Terms 
> Aggregation.  It does what it says  quite well. and I am glad this 
> functionality exists. However, I would like to see some possibilities of 
> enhancements:
>
> What I noticed in my aggregation results is  a lot of Stopwords (a, an, 
> the, at, and, etc.) being included as significant terms. perhaps having the 
> possibility of including Stopword lists so that these stop words are not 
> included in the signifiant term calculations.  (The significance is 
> calculated based on how many times a term appears in the query result vs 
> how many times it appears in whole index. ) For common stop words this 
>  calculation i going to make them very significant. 
>
> Another possible enhancement would be get a phrase significance (instead 
> of a single term, doing a multi term significance) would be nice. 
>
> In the blog post, a similar effect is obtained by highlighting the terms 
> that are identified as significant.But it would be nice to just look at the 
> buckets and determine that.
>
>
> Cheers and Thanks for all the fish
>
>
> Ramdev
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/d8c9c4fb-e0db-44d6-917d-69cdc5d16dad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Significant Term aggregation

Reply via email to