Re: Sanitizing user input for Search

Ryan Zezeski Fri, 03 Feb 2012 08:56:21 -0800

Alexander,

Search stores all fields/values as UTF-8.  As long as your encoding reaches
Search as UTF-8 things should work.

However, if your data contains different languages that is only part of the
problem.  The other part is that analyzers need to be aware of language.
 E.g., the definition of a "word" in English is different from Chinese.
 All the analyzers in Search analyze based on ASCII, e.g. a word boundary
is a space (0x20).  Now, the Search analyzers may not be aware of other
languages but they will treat both indexes and queries the same way, so
even it's wrong it's consistently wrong (I hope that makes sense).

We have some tests that check different character sets on top of UTF-8 but
to be extra sure I would run some tests yourself to verify the entire stack
plays well together.

-Ryan

On Thu, Feb 2, 2012 at 8:09 AM, Alexander Sicular <[email protected]>wrote:

> Hello All,
>
> Are there limitations as to character sets or other special characters
> (perhaps a guide or docs) that I should be actively filtering out of user
> generated searches that will be passed to Riak Search?
>
> Yes, standard filtering rules apply, but are there any specific "gotchas"
> that people have come across when implementing riak search?
>
> A rough sketch of my flow:
> Browser > nodejs > riak search > mapreduce
>
>
> Tia,
> Alexander
>
> @siculars on twitter
> http://siculars.posterous.com
>
> Sent from my iRotaryPhone
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Sanitizing user input for Search

Reply via email to