Yet another approach is to transform your keys into byte comparable values
that preserve your desired sort order, and store that instead. The ICU
library has the ability to do this for various collations of UTF strings:

http://userguide.icu-project.org/collation/architecture#TOC-Sort-Keys

So for this case HBase could store the ICU sortkey rather than the actual
UTF string. You then get correct scans, but just as in Ian's example, you
need to implement a layer that converts requests your client requests to
HBase UTF to sortkey. This will almost certainly give you better HBase
performance since memcmp is generally faster than a custom comparator.

On Fri, Jun 8, 2012 at 10:40 AM, Ian Varley <ivar...@salesforce.com> wrote:

> Tom, another approach you could take would be to store an ASCII encoded
> version of the string as the row key or column qualifier, and then the full
> UTF-8 string elsewhere (e.g. in the cell value, or even later in the row
> key). That wouldn't work out the fine sorting (whether "è" sorts before or
> after "e") but it would solve the gross sorting ("è" would always come
> before "f"). If you need true UTF-8 collation in the results, you could
> then implement it as a layer on top of that (in your app, or maybe a
> co-processor, I'm not sure about the latter). But at least with this
> approach, you'd be able to take advantage of rowkey ranges in your scans,
> which would probably make up for any time spent doing a secondary sort.
>
> Ian
>
> On Jun 8, 2012, at 12:34 PM, Tom Brown wrote:
>
> > Storing the bytes as native UTF-16 or UTF-32 will not help.  Even
> > strings in UTF-8 format can be sorted by their code points when stored
> > as bytes.  Unfortunately, that's not really useful for collation as
> > characters like "è" (U+00E8) should appear between "e" (U+0065) and
> > "f" (U+0066), but the code points to not allow this.
> >
> > Thanks anyway!
> >
> > --Tom
> >
> > On Fri, Jun 8, 2012 at 11:14 AM, Stack <st...@duboce.net> wrote:
> >> On Fri, Jun 8, 2012 at 9:35 AM, Tom Brown <tombrow...@gmail.com> wrote:
> >>> Is there any way to control introduce a different ordering scheme from
> >>> the base comparable bytes?  My use case is that I am using UTF-8 data
> >>> for my keys, and I would like to have scans use UTF-8 collation.
> >>>
> >>> Could this be done by providing an alternate implementation of
> >>> WritableComparable<ImmutableBytesWritable>?
> >>>
> >>> Thanks in advance!
> >>>
> >>
> >> Unfortunately no Tom.  The database is all sorted the same way.
> >> Different sorts per table would complicate system interactions (the
> >> catalog tables would have to change sort by table).  It might be
> >> doable but it would take some work.
> >>
> >> Can you store your data UTF-16 or UTF-32?  Its a while since I dealt
> >> w/ this stuff but IIRC, their sort order is byte order?  (WARNING!  I
> >> could be way off here).
> >>
> >> St.Ack
>
>

Reply via email to