Re: Serious issue updating Cassandra version and topology

2012-07-10 Thread Michael Theroux
Hello Aaron,

Thank you for responding.  Since the time of my original email, we noticed that 
in the process of performing this upgrade that data was lost.  We have restored 
from backup and are now trying this again with two changes:

1) We will be using 1.1.2 throughout the cluster
2) We have switched back to Tiered compaction

In the process I've hit another very interesting issue that I will write a 
separate email about.

However, to answer your questions, this happened on the 1.1.2 node and it 
happened against after you ran the scrub.  The data has been around for a 
while.  We upgraded from 1.0.7 -> 1.1.2.

Unfortunately, I can't check the sstables as we've restarted the migration from 
the beginning.  If it happens again, I'll respond with more information.  

Thanks again,
-Mike

On Jul 10, 2012, at 5:05 AM, aaron morton wrote:

> To be clear, this happened on a 1.1.2 node and it happened again *after* you 
> had run a scrub ? 
> 
> Has this cluster been around for a while or was the data created with 1.1 ?
> 
> Can you confirm that all sstables were re-written for the CF? Check the 
> timestamp on the files. Also also files should have the same version, the 
> -h?- part of the name.
> 
> Can you repair the other CF's ? 
> 
> If this cannot be repaired by scrub or upgradetables you may need to cut the 
> row out of the sstables. Using sstable2json and json2sstable. 
> 
> 
> Cheers
> 
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 8/07/2012, at 4:05 PM, Michael Theroux wrote:
> 
>> Hello,
>> 
>> We're in the process of trying to move a 6-node cluster from RF=1 to RF=3. 
>> Once our replication factor was upped to 3, we ran nodetool repair, and 
>> immediately hit an issue on the first node we ran repair on:
>> 
>> INFO 03:08:51,536 Starting repair command #1, repairing 2 ranges.
>> INFO 03:08:51,552 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] new 
>> session: will sync xxx-xx-xx-xxx-132.compute-1.amazonaws.com/10.202.99.101, 
>> /10.29.187.61 on range 
>> (Token(bytes[d558]),Token(bytes[])]
>>  for x.[a, b, c, d, e, f, g, h, i, 
>> j, k, l, m, n, o, p, q, r, s]
>> INFO 03:08:51,555 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] requesting 
>> merkle trees for a (to [/10.29.187.61, 
>> xxx-xx-xx-xxx-compute-1.amazonaws.com/10.202.99.101])
>> INFO 03:08:52,719 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] Received 
>> merkle tree for a from /10.29.187.61
>> INFO 03:08:53,518 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] Received 
>> merkle tree for a from 
>> xxx-xx-xx-xxx-.compute-1.amazonaws.com/10.202.99.101
>> INFO 03:08:53,519 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] requesting 
>> merkle trees for b (to [/10.29.187.61, 
>> xxx-xx-xx-xxx-132.compute-1.amazonaws.com/10.202.99.101])
>> INFO 03:08:53,639 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] Endpoints 
>> /10.29.187.61 and xxx-xx-xx-xxx-132.compute-1.amazonaws.com/10.202.99.101 
>> are consistent for a
>> INFO 03:08:53,640 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] a is 
>> fully synced (18 remaining column family to sync for this session)
>> INFO 03:08:54,049 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] Received 
>> merkle tree for b from /10.29.187.61
>> ERROR 03:09:09,440 Exception in thread Thread[ValidationExecutor:1,1,main]
>> java.lang.AssertionError: row 
>> DecoratedKey(Token(bytes[efd5654ce92a705b14244e2f5f73ab98c3de2f66c7adbd71e0e893997e198c47]),
>>  efd5654ce92a705b14244e2f5f73ab98c3de2f66c7adbd71e0e893997e198c47) received 
>> out of order wrt 
>> DecoratedKey(Token(bytes[f33a5ad4a45e8cac7987737db246ddfe9294c95bea40f411485055f5dbecbadb]),
>>  f33a5ad4a45e8cac7987737db246ddfe9294c95bea40f411485055f5dbecbadb)
>>  at 
>> org.apache.cassandra.service.AntiEntropyService$Validator.add(AntiEntropyService.java:349)
>>  at 
>> org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:712)
>>  at 
>> org.apache.cassandra.db.compaction.CompactionManager.access$600(CompactionManager.java:68)
>>  at 
>> org.apache.cassandra.db.compaction.CompactionManager$8.call(CompactionManager.java:438)
>>  at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>>  at java.util.concurrent.FutureTask.run(Unknown Source)
>>  at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
>> Source)
>>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>  at java.lang.Thread.run(Unknown Source)
>> 
>> It looks from the log above, the sync of the "a" column family was 
>> successful.  However, the "b" column family resulted in this error.  In 
>> addition, the repair hung after this error.  We ran node tool scrub on all 
>> nodes and invalidated the key and row caches and tried again (with RF=2)

Re: Serious issue updating Cassandra version and topology

2012-07-10 Thread aaron morton
To be clear, this happened on a 1.1.2 node and it happened again *after* you 
had run a scrub ? 

Has this cluster been around for a while or was the data created with 1.1 ?

Can you confirm that all sstables were re-written for the CF? Check the 
timestamp on the files. Also also files should have the same version, the -h?- 
part of the name.

Can you repair the other CF's ? 

If this cannot be repaired by scrub or upgradetables you may need to cut the 
row out of the sstables. Using sstable2json and json2sstable. 

 
Cheers
 
-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 8/07/2012, at 4:05 PM, Michael Theroux wrote:

> Hello,
> 
> We're in the process of trying to move a 6-node cluster from RF=1 to RF=3. 
> Once our replication factor was upped to 3, we ran nodetool repair, and 
> immediately hit an issue on the first node we ran repair on:
> 
>  INFO 03:08:51,536 Starting repair command #1, repairing 2 ranges.
>  INFO 03:08:51,552 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] new 
> session: will sync xxx-xx-xx-xxx-132.compute-1.amazonaws.com/10.202.99.101, 
> /10.29.187.61 on range 
> (Token(bytes[d558]),Token(bytes[])]
>  for x.[a, b, c, d, e, f, g, h, i, 
> j, k, l, m, n, o, p, q, r, s]
>  INFO 03:08:51,555 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] requesting 
> merkle trees for a (to [/10.29.187.61, 
> xxx-xx-xx-xxx-compute-1.amazonaws.com/10.202.99.101])
>  INFO 03:08:52,719 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] Received 
> merkle tree for a from /10.29.187.61
>  INFO 03:08:53,518 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] Received 
> merkle tree for a from 
> xxx-xx-xx-xxx-.compute-1.amazonaws.com/10.202.99.101
>  INFO 03:08:53,519 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] requesting 
> merkle trees for b (to [/10.29.187.61, 
> xxx-xx-xx-xxx-132.compute-1.amazonaws.com/10.202.99.101])
>  INFO 03:08:53,639 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] Endpoints 
> /10.29.187.61 and xxx-xx-xx-xxx-132.compute-1.amazonaws.com/10.202.99.101 are 
> consistent for a
>  INFO 03:08:53,640 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] a is 
> fully synced (18 remaining column family to sync for this session)
>  INFO 03:08:54,049 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] Received 
> merkle tree for b from /10.29.187.61
> ERROR 03:09:09,440 Exception in thread Thread[ValidationExecutor:1,1,main]
> java.lang.AssertionError: row 
> DecoratedKey(Token(bytes[efd5654ce92a705b14244e2f5f73ab98c3de2f66c7adbd71e0e893997e198c47]),
>  efd5654ce92a705b14244e2f5f73ab98c3de2f66c7adbd71e0e893997e198c47) received 
> out of order wrt 
> DecoratedKey(Token(bytes[f33a5ad4a45e8cac7987737db246ddfe9294c95bea40f411485055f5dbecbadb]),
>  f33a5ad4a45e8cac7987737db246ddfe9294c95bea40f411485055f5dbecbadb)
>   at 
> org.apache.cassandra.service.AntiEntropyService$Validator.add(AntiEntropyService.java:349)
>   at 
> org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:712)
>   at 
> org.apache.cassandra.db.compaction.CompactionManager.access$600(CompactionManager.java:68)
>   at 
> org.apache.cassandra.db.compaction.CompactionManager$8.call(CompactionManager.java:438)
>   at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>   at java.util.concurrent.FutureTask.run(Unknown Source)
>   at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
> Source)
>   at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>   at java.lang.Thread.run(Unknown Source)
> 
> It looks from the log above, the sync of the "a" column family was 
> successful.  However, the "b" column family resulted in this error.  In 
> addition, the repair hung after this error.  We ran node tool scrub on all 
> nodes and invalidated the key and row caches and tried again (with RF=2), and 
> it didn't help alleviate the problem.
> 
> Some other important pieces of information:
> We use ByteOrderedPartitioner (we MD5 hash the keys ourselves)
> We're using Leveled Compaction
> As we're in the middle of a transition, one node is on 1.1.2 (the one we 
> tried repair on), the other 5 are on 1.1.1
> 
> Thanks,
> -Mike
>