Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5

aaron morton Thu, 05 May 2011 03:48:55 -0700

I take it back, the problem started in 0.6 where keys were strings. Looking 
into how 0.6 did it's thing



-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 5 May 2011, at 22:36, aaron morton wrote:

> Interesting but as we are dealing with keys it should not matter as they are 
> treated as byte buffers. 
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 5 May 2011, at 04:53, Daniel Doubleday wrote:
> 
>> This is a bit of a wild guess but Windows and encoding and 0.7.5 sounds like
>> 
>> https://issues.apache.org/jira/browse/CASSANDRA-2367
>> 
>>  
>> On May 3, 2011, at 5:15 PM, Henrik Schröder wrote:
>> 
>>> Hey everyone,
>>> 
>>> We did some tests before upgrading our Cassandra cluster from 0.6 to 0.7, 
>>> just to make sure that the change in how keys are encoded wouldn't cause us 
>>> any dataloss. Unfortunately it seems that rows stored under a unicode key 
>>> couldn't be retrieved after the upgrade. We're running everything on 
>>> Windows, and we're using the generated thrift client in C# to access it.
>>> 
>>> I managed to make a minimal test to reproduce the error consistently:
>>> 
>>> First, I started up Cassandra 0.6.13 with an empty data directory, and a 
>>> really simple config with a single keyspace with a single bytestype 
>>> columnfamily.
>>> I wrote two rows, each with a single column with a simple column name and a 
>>> 1-byte value of "1". The first row had a key using only ascii chars 
>>> ('foo'), and the second row had a key using unicode chars ('ドメインウ').
>>> 
>>> Using multi_get, and both those keys, I got both columns back, as expected.
>>> Using multi_get_slice and both those keys, I got both columns back, as 
>>> expected.
>>> I also did a get_range_slices to get all rows in the columnfamily, and I 
>>> got both columns back, as expected.
>>> 
>>> So far so good. Then I drain and shut down Cassandra 0.6.13, and start up 
>>> Cassandra 0.7.5, pointing to the same data directory, with a config 
>>> containing the same keyspace, and I run the schematool import command.
>>> 
>>> I then start up my test program that uses the new thrift api, and run some 
>>> commands.
>>> 
>>> Using multi_get_slice, and those two keys encoded as UTF8 byte-arrays, I 
>>> only get back one column, the one under the key 'foo'. The other row I 
>>> simply can't retrieve.
>>> 
>>> However, when I use get_range_slices to get all rows, I get back two rows, 
>>> with the correct column values, and the byte-array keys are identical to my 
>>> encoded keys, and when I decode the byte-arrays as UTF8 drings, I get back 
>>> my two original keys. This means that both my rows are still there, the 
>>> keys as output by Cassandra are identical to the original string keys I 
>>> used when I created the rows in 0.6.13, but it's just impossible to 
>>> retrieve the second row.
>>> 
>>> To continue the test, I inserted a row with the key 'ドメインウ' encoded as 
>>> UTF-8 again, and gave it a similar column as the original, but with a 
>>> 1-byte value of "2".
>>> 
>>> Now, when I use multi_get_slice with my two encoded keys, I get back two 
>>> rows, the 'foo' row has the old value as expected, and the other row has 
>>> the new value as expected.
>>> 
>>> However, when I use get_range_slices to get all rows, I get back *three* 
>>> rows, two of which have the *exact same* byte-array key, one has the old 
>>> column, one has the new column. 
>>> 
>>> 
>>> How is this possible? How can there be two different rows with the exact 
>>> same key? I'm guessing that it's related to the encoding of string keys in 
>>> 0.6, and that the internal representation is off somehow. I checked the 
>>> generated thrift client for 0.6, and it UTF8-encodes all keys before 
>>> sending them to the server, so it should be UTF8 all the way, but 
>>> apparently it isn't.
>>> 
>>> Has anyone else experienced the same problem? Is it a platform-specific 
>>> problem? Is there a way to avoid this and upgrade from 0.6 to 0.7 and not 
>>> lose any rows? I would also really like to know which byte-array I should 
>>> send in to get back that second row, there's gotta be some key that can be 
>>> used to get it, the row is still there after all.
>>> 
>>> 
>>> /Henrik Schröder
>> 
>

Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5

Reply via email to