Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5

Henrik Schröder Tue, 03 May 2011 08:16:02 -0700

Hey everyone,

We did some tests before upgrading our Cassandra cluster from 0.6 to 0.7,
just to make sure that the change in how keys are encoded wouldn't cause us
any dataloss. Unfortunately it seems that rows stored under a unicode key
couldn't be retrieved after the upgrade. We're running everything on
Windows, and we're using the generated thrift client in C# to access it.


I managed to make a minimal test to reproduce the error consistently:

First, I started up Cassandra 0.6.13 with an empty data directory, and a
really simple config with a single keyspace with a single bytestype
columnfamily.
I wrote two rows, each with a single column with a simple column name and a
1-byte value of "1". The first row had a key using only ascii chars ('foo'),
and the second row had a key using unicode chars ('ドメインウ').

Using multi_get, and both those keys, I got both columns back, as expected.
Using multi_get_slice and both those keys, I got both columns back, as
expected.
I also did a get_range_slices to get all rows in the columnfamily, and I got
both columns back, as expected.

So far so good. Then I drain and shut down Cassandra 0.6.13, and start up
Cassandra 0.7.5, pointing to the same data directory, with a config
containing the same keyspace, and I run the schematool import command.

I then start up my test program that uses the new thrift api, and run some
commands.

Using multi_get_slice, and those two keys encoded as UTF8 byte-arrays, I
only get back one column, the one under the key 'foo'. The other row I
simply can't retrieve.

However, when I use get_range_slices to get all rows, I get back two rows,
with the correct column values, and the byte-array keys are identical to my
encoded keys, and when I decode the byte-arrays as UTF8 drings, I get back
my two original keys. This means that both my rows are still there, the keys
as output by Cassandra are identical to the original string keys I used when
I created the rows in 0.6.13, but it's just impossible to retrieve the
second row.

To continue the test, I inserted a row with the key 'ドメインウ' encoded as UTF-8
again, and gave it a similar column as the original, but with a 1-byte value
of "2".

Now, when I use multi_get_slice with my two encoded keys, I get back two
rows, the 'foo' row has the old value as expected, and the other row has the
new value as expected.

However, when I use get_range_slices to get all rows, I get back *three*
rows, two of which have the *exact same* byte-array key, one has the old
column, one has the new column.


How is this possible? How can there be two different rows with the exact
same key? I'm guessing that it's related to the encoding of string keys in
0.6, and that the internal representation is off somehow. I checked the
generated thrift client for 0.6, and it UTF8-encodes all keys before sending
them to the server, so it should be UTF8 all the way, but apparently it
isn't.

Has anyone else experienced the same problem? Is it a platform-specific
problem? Is there a way to avoid this and upgrade from 0.6 to 0.7 and not
lose any rows? I would also really like to know which byte-array I should
send in to get back that second row, there's gotta be some key that can be
used to get it, the row is still there after all.


/Henrik Schröder

Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5

Reply via email to