Re: Most common adjacent words

2015-02-23 Thread jari
Thanks for the suggestion. 

I tried your second idea but it seems like running a terms aggregation on 
my shingles text field is a bit too much for ES. 
Even if it did work, it wouldn't have given me any data on adjacency / 
proximity.

[FIELDDATA] Data too large, data for [text.shingles] would be larger than 
limit of [623326003/594.4mb]]

{
   "size": 0,
   "aggregations": {
  "myAggregation": {
 "filter": {
"query": {
   "query_string": {
  "default_field": "text",
  "query": "foo"
   }
}
 },
 "aggregations": {
 "combos": {
 "terms": { "field": "text.shingles" }
 }
 }
  }
   }
}


On Monday, February 23, 2015 at 5:01:59 AM UTC+1, David Pilato wrote:
>
> I don't see a way to do exactly what you are looking for.
> But, with a little effort on client you could give a try to the 
> highlighting feature which could give something similar.
>
> Or may be an aggregation with a first level agg as a filter for the term, 
> then a Terms agg on the field but with a shingle analyzer,
> Might give some results.
>
>
> HTH.
>
> --
> David ;-)
> Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
>
> Le 23 févr. 2015 à 00:42, ja...@holderdeord.no  a écrit :
>
> Hello,
>
> Does elasticsearch have the ability to return the most common *adjacent* 
> words for a given search query?
>
> That is, given some documents:
>
> {"text": "To be or not to be, that is the question"}
> {"text": "We know what we are, but know not what we may be"}
> {"text": "If music be the food of love, play on."}
>
> If I search for "be", I'd like to get back something like this:
>
> { "to be": 2 }
> { "may be": 1 }
> { "be or": 1 }
> { "music be": 1 }
> { "be the": 1 }
>
> I was looking at the phrase suggester, but couldn't make it do this (it 
> seems adamant about correcting the input text).
> If there's no way to do this currently, would it be feasible to write a 
> plugin to do so?
>
> Any advice is much appreciated.
>
> Jari
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearc...@googlegroups.com .
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/1f4a0ffa-b78c-426c-bc9a-76b068833544%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/elasticsearch/1f4a0ffa-b78c-426c-bc9a-76b068833544%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c8c42fde-3b85-44a9-8802-b4c6e4fc0545%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Most common adjacent words

2015-02-22 Thread jari
Hello,

Does elasticsearch have the ability to return the most common *adjacent* 
words for a given search query?

That is, given some documents:

{"text": "To be or not to be, that is the question"}
{"text": "We know what we are, but know not what we may be"}
{"text": "If music be the food of love, play on."}

If I search for "be", I'd like to get back something like this:

{ "to be": 2 }
{ "may be": 1 }
{ "be or": 1 }
{ "music be": 1 }
{ "be the": 1 }

I was looking at the phrase suggester, but couldn't make it do this (it 
seems adamant about correcting the input text).
If there's no way to do this currently, would it be feasible to write a 
plugin to do so?

Any advice is much appreciated.

Jari

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1f4a0ffa-b78c-426c-bc9a-76b068833544%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Combining two aggregations to get term percentage

2015-02-17 Thread Jari Bakken
Yes!

If I have to do the division on my own I might as well stick with the two
aggregations, AFAICT.

But if it was available as a scoring heuristic I could effectively use {size:
N} so I don’t have to fetch the full set of countries to do this
calculation.

I’ve opened a feature request here
<https://github.com/elasticsearch/elasticsearch/issues/9720>.
​


On Tue, Feb 17, 2015 at 10:52 AM, Mark Harwood <
mark.harw...@elasticsearch.com> wrote:

> You can choose to ignore the score and compute your own by dividing
> doc_count by bg_count.
>

> Your post has made me think we should add this more easily explainable
> metric as one of the scoring heuristics we offer for this aggregation.
>
> On Tuesday, February 17, 2015 at 10:44:12 AM UTC, Jari Bakken wrote:
>>
>> Thanks Mark!
>>
>> I've been planning to look into `significant_terms`, but didn't know it
>> could help me with this. I'm a bit concerned that a too clever scoring
>> could be hard to explain to users, but I'll give it a shot.
>>
>>
>> On Tue, Feb 17, 2015 at 9:41 AM, Mark Harwood > com> wrote:
>>
>>> Nice to see someone taking the trouble to put their stats in context.
>>> Drives me nuts every time I see the equivalent of this:
>>> http://xkcd.com/1138/
>>>
>>> So we have a feature that does some of what you are after - it's called
>>> the "significant_terms" aggregation.
>>> Your query would look like this:
>>> {
>>> "query" :
>>> {
>>>  "match" : {
>>> "text": "foo"
>>> }
>>> },
>>> "aggs":{
>>> "keywords":{
>>> "significant_terms":{
>>> "field":"country",
>>> "size":100
>>> }
>>> }
>>> }
>>> }
>>>
>>> What you get back are buckets for each country with a doc_count that
>>> represents how many "foo" documents there were in that country and a
>>> background count called "bg_count" which is how many docs (foo and non foo)
>>> came from that country. Selections are ranked using a score that is
>>> returned and which is more nuanced than a straight doc_count/bg_count
>>> percentage. In practice we find prioritizing selections solely by a
>>> percentage measure can skew results towards very rare terms (in your case v
>>> small countries) that have few data samples and so can more easily achieve
>>> high-scoring percentages. Instead, we offer a variety of scoring heuristics
>>> which place a different emphasis on popular vs rare when it comes to
>>> ranking: (see https://twitter.com/elasticmark/status/513320986956292096
>>> )
>>>
>>> Cheers
>>> Mark
>>>
>>> On Tuesday, February 17, 2015 at 1:07:31 AM UTC, ja...@holderdeord.no
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm looking for a way to have Elasticsearch calculate the percentage of
>>>> docs that match a query *within* a terms aggregation.
>>>> That is, given two aggregations where one is filtered and the other is
>>>> not:
>>>>
>>>> {
>>>> aggregations: {
>>>> countries: {
>>>> filter: {
>>>> query: {
>>>> query_string: {
>>>> default_field: "description",
>>>> query: "foo"
>>>> }
>>>> }
>>>> },
>>>> aggregations: {
>>>> filteredCountries: {
>>>> terms: { field: "country" }
>>>> }
>>>> }
>>>> },
>>>> totalCountries: {
>>>> terms: { field: "countries" }
>>>> }
>>>> },
>>>> size: 0
>>>> }
>>>>
>>>> Let's say the totalCountries buckets are:
>>>>
>>>> "buckets": [
>>>> {
>>>> "key": "USA",
>>>> "doc_count": 100
>>>> },
>>>> {
>>>> "key": "UK",
>>>> "doc_count": 50
>>>> }
>>>> ]
>>>&

Re: Combining two aggregations to get term percentage

2015-02-17 Thread Jari Bakken
Thanks Mark!

I've been planning to look into `significant_terms`, but didn't know it
could help me with this. I'm a bit concerned that a too clever scoring
could be hard to explain to users, but I'll give it a shot.


On Tue, Feb 17, 2015 at 9:41 AM, Mark Harwood <
mark.harw...@elasticsearch.com> wrote:

> Nice to see someone taking the trouble to put their stats in context.
> Drives me nuts every time I see the equivalent of this:
> http://xkcd.com/1138/
>
> So we have a feature that does some of what you are after - it's called
> the "significant_terms" aggregation.
> Your query would look like this:
> {
> "query" :
> {
>  "match" : {
> "text": "foo"
> }
> },
> "aggs":{
> "keywords":{
> "significant_terms":{
> "field":"country",
> "size":100
> }
> }
> }
> }
>
> What you get back are buckets for each country with a doc_count that
> represents how many "foo" documents there were in that country and a
> background count called "bg_count" which is how many docs (foo and non foo)
> came from that country. Selections are ranked using a score that is
> returned and which is more nuanced than a straight doc_count/bg_count
> percentage. In practice we find prioritizing selections solely by a
> percentage measure can skew results towards very rare terms (in your case v
> small countries) that have few data samples and so can more easily achieve
> high-scoring percentages. Instead, we offer a variety of scoring heuristics
> which place a different emphasis on popular vs rare when it comes to
> ranking: (see https://twitter.com/elasticmark/status/513320986956292096 )
>
> Cheers
> Mark
>
> On Tuesday, February 17, 2015 at 1:07:31 AM UTC, ja...@holderdeord.no
> wrote:
>>
>> Hi,
>>
>> I'm looking for a way to have Elasticsearch calculate the percentage of
>> docs that match a query *within* a terms aggregation.
>> That is, given two aggregations where one is filtered and the other is
>> not:
>>
>> {
>> aggregations: {
>> countries: {
>> filter: {
>> query: {
>> query_string: {
>> default_field: "description",
>> query: "foo"
>> }
>> }
>> },
>> aggregations: {
>> filteredCountries: {
>> terms: { field: "country" }
>> }
>> }
>> },
>> totalCountries: {
>> terms: { field: "countries" }
>> }
>> },
>> size: 0
>> }
>>
>> Let's say the totalCountries buckets are:
>>
>> "buckets": [
>> {
>> "key": "USA",
>> "doc_count": 100
>> },
>> {
>> "key": "UK",
>> "doc_count": 50
>> }
>> ]
>>
>>
>> and the filteredCountries buckets are:
>>
>> "buckets": [
>> {
>> "key": "USA",
>> "doc_count": 10
>> },
>> {
>> "key": "UK",
>> "doc_count": 25
>> }
>> ]
>>
>>
>> Is there a way to get a response that returns filteredCountries as
>> percentages of totalCountries? I.e. something like:
>>
>> [
>> {
>> "key": "USA",
>> "percent": 10
>> },
>> {
>> "key": "UK",
>> "percent": 50
>> }
>> ]
>>
>> Thanks!
>>
>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/1ojltqSRdhA/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/5337cd90-a434-4a44-9a81-969e55568389%40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAP4LNbgBjhXyB3rXUPD-nfOg89MsUOLiNSLJtRO78F5WHH9vxA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Combining two aggregations to get term percentage

2015-02-16 Thread jari
Hi,

I'm looking for a way to have Elasticsearch calculate the percentage of 
docs that match a query *within* a terms aggregation. 
That is, given two aggregations where one is filtered and the other is not:

{
aggregations: {
countries: {
filter: {   
query: {
query_string: {
default_field: "description",
query: "foo"
}
}
},
aggregations: { 
filteredCountries: { 
terms: { field: "country" }
}
}
},
totalCountries: {
terms: { field: "countries" }
}
},
size: 0
}

Let's say the totalCountries buckets are:

"buckets": [
{
"key": "USA",
"doc_count": 100
},
{
"key": "UK",
"doc_count": 50
}
]


and the filteredCountries buckets are: 

"buckets": [
{
"key": "USA",
"doc_count": 10
},
{
"key": "UK",
"doc_count": 25
}
]


Is there a way to get a response that returns filteredCountries as 
percentages of totalCountries? I.e. something like:

[
{
"key": "USA",
"percent": 10
},
{
"key": "UK",
"percent": 50
}
]

Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8bbdff97-e2a0-415e-ba4f-f418a279be27%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Recreating Google's Ngram Viewer with elasticsearch

2014-11-09 Thread jari
Hello,

I'm looking for tips on how to recreate something like Google's Ngram viewer 
<https://books.google.com/ngrams> with elasticsearch. I have a text corpus 
of < 500 MB for which this kind of tool would be very valuable.

I've had some success with the shingle token filter 
<http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html>
 and 
the date histogram aggregation 
<http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html>,
 
but the results are not ideal: I'd like to get a histogram of word/phrase 
frequencies, not a histogram of how many documents the word/phrase occurs 
in. 

It looks like what I need is some kind of combination of shingles, term 
vectors 
<http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-termvectors.html>
 and the 
date histogram aggregation, but I'm not sure how to proceed. I can improve 
my current approach by breaking the corpus into smaller pieces, i.e. make 
my documents be paragraphs instead of chapters. But what I really want is a 
"shingle frequency date histogram". 

Is this something that can be accomplished with elasticsearch?

Jari

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/4b37f0a1-4611-4260-85fb-36b4d67c6076%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.