Recreating Google's Ngram Viewer with elasticsearch

jari Sun, 09 Nov 2014 11:17:31 -0800

Hello,

I'm looking for tips on how to recreate something like Google's Ngram viewer 
<https://books.google.com/ngrams> with elasticsearch. I have a text corpus 
of < 500 MB for which this kind of tool would be very valuable.


I've had some success with the shingle token filter 
<http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html>
 and 
the date histogram aggregation 
<http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html>,
 
but the results are not ideal: I'd like to get a histogram of word/phrase 
frequencies, not a histogram of how many documents the word/phrase occurs 
in. 

It looks like what I need is some kind of combination of shingles, term 
vectors 
<http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-termvectors.html>
 and the 
date histogram aggregation, but I'm not sure how to proceed. I can improve 
my current approach by breaking the corpus into smaller pieces, i.e. make 
my documents be paragraphs instead of chapters. But what I really want is a 
"shingle frequency date histogram". 

Is this something that can be accomplished with elasticsearch?

Jari

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/4b37f0a1-4611-4260-85fb-36b4d67c6076%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Recreating Google's Ngram Viewer with elasticsearch

Reply via email to