Rob, Riak Search doesn't have a traditional term-frequency count. It has something similar but it's an estimate and it is much more expensive than a simple table lookup. Even if it did have term-frequency it doesn't really expose it to the outside world. Not only that, but the standard analyzer provides no way to specify additional stop words. You'd have to keep track of this data externally and do some pre-processing to remove stopwords before.
For the last 9 months I've been working on a project called Yokozuna with the goal to replace Riak Search [1]. It's like Riak Search except much better because the underlying engine is actually Solr/Lucene, not an inferior clone written in Erlang. In that case you could add new stopwords, exploit query caching, and use newer features like LUCENE-4628 [2] to help combat high frequency terms. You'd also have an easy way to get frequency count for a given term to determine if you should make it a stopword. [1] https://github.com/basho/yokozuna [2] https://issues.apache.org/jira/browse/LUCENE-4628 On Fri, Mar 22, 2013 at 2:21 PM, Rob Speer <r...@luminoso.com> wrote: > My company is starting to use Riak for document storage. I'm pretty happy > about how it has been working so far, but I see the messages of foreboding > and doom out there about Riak Search and I've encountered a problem myself. > > I can't really avoid using Riak Search, as full text indexing is a key > feature we need to provide. If Riak Search is suboptimal, so is basically > every other text index out there. We've just been burned by ElasticSearch's > ineffective load balancing (who would have guessed, consistent hashing is > kind of important). > > I know that performing searches in Riak Search that return many thousands > of documents is discouraged for performance reasons, and the developers > encourage removing stopwords to help with this. There's additionally, I > have seen, a hard limit on the number of documents that can be examined by > a search query; if any term matches more than 100,000 documents, the query > will return a too_many_results error (and, incidentally, things will get so > confused that, in the Python client, the *next* query will also fail with > an HTTP error 400). > > The question is, what should I actually do to avoid this case? I've > already removed the usual stopwords, but any particular set of documents > might have its own personal stopwords. For example, in a database of > millions of hotel reviews, the word 'hotel' could easily appear in more > than 100,000 documents. > > If we need to search for '5-star hotel', it's wasteful and probably > crash-prone to retrieve all the 'hotel' results. What I'd really like to do > is just search for '5-star', which because of IDF scoring will have about > the same effect. That requires knowing somehow that the word 'hotel' > appears in too many documents. > > Is there a way to determine, via Riak, which terms are overused so I can > remove them from search queries? Or do I need to keep track of this > entirely on the client end so I can avoid searching for those terms? > > Thanks, > -- Rob Speer > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com