Re: How should I avoid words that are effectively stopwords in Riak Search?

Ryan Zezeski Fri, 29 Mar 2013 10:31:00 -0700

Rob,

Riak Search doesn't have a traditional term-frequency count.  It has
something similar but it's an estimate and it is much more expensive than a
simple table lookup.  Even if it did have term-frequency it doesn't really
expose it to the outside world.  Not only that, but the standard analyzer
provides no way to specify additional stop words.  You'd have to keep track
of this data externally and do some pre-processing to remove stopwords
before.


For the last 9 months I've been working on a project called Yokozuna with
the goal to replace Riak Search [1].  It's like Riak Search except much
better because the underlying engine is actually Solr/Lucene, not an
inferior clone written in Erlang.  In that case you could add new
stopwords, exploit query caching, and use newer features like LUCENE-4628
[2] to help combat high frequency terms.  You'd also have an easy way to
get frequency count for a given term to determine if you should make it a
stopword.

[1] https://github.com/basho/yokozuna

[2] https://issues.apache.org/jira/browse/LUCENE-4628


On Fri, Mar 22, 2013 at 2:21 PM, Rob Speer <r...@luminoso.com> wrote:

> My company is starting to use Riak for document storage. I'm pretty happy
> about how it has been working so far, but I see the messages of foreboding
> and doom out there about Riak Search and I've encountered a problem myself.
>
> I can't really avoid using Riak Search, as full text indexing is a key
> feature we need to provide. If Riak Search is suboptimal, so is basically
> every other text index out there. We've just been burned by ElasticSearch's
> ineffective load balancing (who would have guessed, consistent hashing is
> kind of important).
>
> I know that performing searches in Riak Search that return many thousands
> of documents is discouraged for performance reasons, and the developers
> encourage removing stopwords to help with this. There's additionally, I
> have seen, a hard limit on the number of documents that can be examined by
> a search query; if any term matches more than 100,000 documents, the query
> will return a too_many_results error (and, incidentally, things will get so
> confused that, in the Python client, the *next* query will also fail with
> an HTTP error 400).
>
> The question is, what should I actually do to avoid this case? I've
> already removed the usual stopwords, but any particular set of documents
> might have its own personal stopwords. For example, in a database of
> millions of hotel reviews, the word 'hotel' could easily appear in more
> than 100,000 documents.
>
> If we need to search for '5-star hotel', it's wasteful and probably
> crash-prone to retrieve all the 'hotel' results. What I'd really like to do
> is just search for '5-star', which because of IDF scoring will have about
> the same effect. That requires knowing somehow that the word 'hotel'
> appears in too many documents.
>

> Is there a way to determine, via Riak, which terms are overused so I can
> remove them from search queries? Or do I need to keep track of this
> entirely on the client end so I can avoid searching for those terms?
>
> Thanks,
> -- Rob Speer
>
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: How should I avoid words that are effectively stopwords in Riak Search?

Reply via email to