Thanks, but patching or losing keys is not an option for us. :-/
/Henrik On Thu, May 5, 2011 at 15:00, Daniel Doubleday <daniel.double...@gmx.net>wrote: > Don't know if that helps you but since we had the same SSTable corruption I > have been looking into that very code the other day: > > If you could afford to drop these rows and are able to recognize them the > easiest way would be patching: > > SSTableScanner:162 > > public IColumnIterator next() > { > try > { > if (row != null) > file.seek(finishedAt); > assert !file.isEOF(); > > DecoratedKey key = > SSTableReader.decodeKey(sstable.partitioner, > > sstable.descriptor, > > ByteBufferUtil.readWithShortLength(file)); > long dataSize = SSTableReader.readRowSize(file, > sstable.descriptor); > long dataStart = file.getFilePointer(); > finishedAt = dataStart + dataSize; > > if (filter == null) > { > row = new SSTableIdentityIterator(sstable, file, key, > dataStart, dataSize); > return row; > } > else > { > return row = filter.getSSTableColumnIterator(sstable, > file, key); > } > } > catch (IOException e) > { > throw new RuntimeException(SSTableScanner.this + " failed > to provide next columns from " + this, e); > } > } > > The string key is new String(ByteBufferUtil.getArray(key.key), "UTF-8") > If you find one that you don't like just skip it. > > This way compaction goes through but obviously you'll loose data. > > On May 5, 2011, at 1:12 PM, Henrik Schröder wrote: > > Yeah, I've seen that one, and I'm guessing that it's the root cause of my > problems, something something encoding error, but that doesn't really help > me. :-) > > However, I've done all my tests with 0.7.5, I'm gonna try them again with > 0.7.4, just to see how that version reacts. > > > /Henrik > > On Wed, May 4, 2011 at 18:53, Daniel Doubleday > <daniel.double...@gmx.net>wrote: > >> This is a bit of a wild guess but Windows and encoding and 0.7.5 sounds >> like >> >> https://issues.apache.org/jira/browse/CASSANDRA-2367 >> >> <https://issues.apache.org/jira/browse/CASSANDRA-2367> >> On May 3, 2011, at 5:15 PM, Henrik Schröder wrote: >> >> Hey everyone, >> >> We did some tests before upgrading our Cassandra cluster from 0.6 to 0.7, >> just to make sure that the change in how keys are encoded wouldn't cause us >> any dataloss. Unfortunately it seems that rows stored under a unicode key >> couldn't be retrieved after the upgrade. We're running everything on >> Windows, and we're using the generated thrift client in C# to access it. >> >> I managed to make a minimal test to reproduce the error consistently: >> >> First, I started up Cassandra 0.6.13 with an empty data directory, and a >> really simple config with a single keyspace with a single bytestype >> columnfamily. >> I wrote two rows, each with a single column with a simple column name and >> a 1-byte value of "1". The first row had a key using only ascii chars >> ('foo'), and the second row had a key using unicode chars ('ドメインウ'). >> >> Using multi_get, and both those keys, I got both columns back, as >> expected. >> Using multi_get_slice and both those keys, I got both columns back, as >> expected. >> I also did a get_range_slices to get all rows in the columnfamily, and I >> got both columns back, as expected. >> >> So far so good. Then I drain and shut down Cassandra 0.6.13, and start up >> Cassandra 0.7.5, pointing to the same data directory, with a config >> containing the same keyspace, and I run the schematool import command. >> >> I then start up my test program that uses the new thrift api, and run some >> commands. >> >> Using multi_get_slice, and those two keys encoded as UTF8 byte-arrays, I >> only get back one column, the one under the key 'foo'. The other row I >> simply can't retrieve. >> >> However, when I use get_range_slices to get all rows, I get back two rows, >> with the correct column values, and the byte-array keys are identical to my >> encoded keys, and when I decode the byte-arrays as UTF8 drings, I get back >> my two original keys. This means that both my rows are still there, the keys >> as output by Cassandra are identical to the original string keys I used when >> I created the rows in 0.6.13, but it's just impossible to retrieve the >> second row. >> >> To continue the test, I inserted a row with the key 'ドメインウ' encoded as >> UTF-8 again, and gave it a similar column as the original, but with a 1-byte >> value of "2". >> >> Now, when I use multi_get_slice with my two encoded keys, I get back two >> rows, the 'foo' row has the old value as expected, and the other row has the >> new value as expected. >> >> However, when I use get_range_slices to get all rows, I get back *three* >> rows, two of which have the *exact same* byte-array key, one has the old >> column, one has the new column. >> >> >> How is this possible? How can there be two different rows with the exact >> same key? I'm guessing that it's related to the encoding of string keys in >> 0.6, and that the internal representation is off somehow. I checked the >> generated thrift client for 0.6, and it UTF8-encodes all keys before sending >> them to the server, so it should be UTF8 all the way, but apparently it >> isn't. >> >> Has anyone else experienced the same problem? Is it a platform-specific >> problem? Is there a way to avoid this and upgrade from 0.6 to 0.7 and not >> lose any rows? I would also really like to know which byte-array I should >> send in to get back that second row, there's gotta be some key that can be >> used to get it, the row is still there after all. >> >> >> /Henrik Schröder >> >> >> > >