[jira] [Updated] (CASSANDRA-8366) Repair grows data on nodes, causes load to become unbalanced

Alan Boudreault (JIRA) Fri, 23 Jan 2015 07:43:02 -0800

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alan Boudreault updated CASSANDRA-8366:
---------------------------------------
    Attachment: run2_no_compact_before_repair.log
                run1_with_compact_before_repair.log
                run3_no_compact_before_repair.log
                testv2.sh

[~krummas] I'm attaching a new version of the test script. (testv2.sh). This 
one has some improvements and gives more details after each operations (it 
shows sstable size, wait properly that all compaction tasks finish, display  
streaming status, it flushes nodes, it cleans nodes etc.).

I've run  3 times the script to see the differences. 

* run1 is the only real successful result. The reason is that I compact all 
nodes right after the cassandra-stress operation. Apparently, this removed the 
need to repair, so everything is fine and at the end of the script all nodes 
are at the proper size (1.43G).

* run2 doesn't compact after the stress. The repair is then ran and we only see 
the "Did not get a positive answer" until the end of the node2 repair. So we 
can see that the keyspace r1 has been successfully repaired for node1 and 
node2. The repair for node3 failed but it seems that the 2 other repairs have 
taken care to repair things so everything is OK at the end of the script. (node 
size ~1.43G)

* run3 doesn't compact after the stress. This time, the repair fails at the 
beginning (node1 repair call). This makes the node2 and node2 repairs fails 
too. After flushing + cleaning + compacting, all nodes have an extra 1G of 
data, which I don't know what they are. There is no streaming, all compaction 
is done and looks like I cannot get rid of them. This is not in the log, but I 
restarted my cluster again, then retried to full repair sequentially all nodes 
then re-cleaning, re-compacting and nothing changed. I let the cluster ran all 
night long to be sure. I have not deleted this cluster so if you need more 
information, I just have to restart it.

Do you see anything wrong in my tests? Ping me on IRC if you want to discuss 
more about this ticket. 




> Repair grows data on nodes, causes load to become unbalanced
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-8366
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8366
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: 4 node cluster
> 2.1.2 Cassandra
> Inserts and reads are done with CQL driver
>            Reporter: Jan Karlsson
>            Assignee: Alan Boudreault
>         Attachments: results-17500000_inc_repair.txt, 
> results-5000000_1_inc_repairs.txt, results-5000000_2_inc_repairs.txt, 
> results-5000000_full_repair_then_inc_repairs.txt, 
> results-5000000_inc_repairs_not_parallel.txt, 
> run1_with_compact_before_repair.log, run2_no_compact_before_repair.log, 
> run3_no_compact_before_repair.log, test.sh, testv2.sh
>
>
> There seems to be something weird going on when repairing data.
> I have a program that runs 2 hours which inserts 250 random numbers and reads 
> 250 times per second. It creates 2 keyspaces with SimpleStrategy and RF of 3. 
> I use size-tiered compaction for my cluster. 
> After those 2 hours I run a repair and the load of all nodes goes up. If I 
> run incremental repair the load goes up alot more. I saw the load shoot up 8 
> times the original size multiple times with incremental repair. (from 2G to 
> 16G)
> with node 9 8 7 and 6 the repro procedure looked like this:
> (Note that running full repair first is not a requirement to reproduce.)
> {noformat}
> After 2 hours of 250 reads + 250 writes per second:
> UN  9  583.39 MB  256     ?       28220962-26ae-4eeb-8027-99f96e377406  rack1
> UN  8  584.01 MB  256     ?       f2de6ea1-de88-4056-8fde-42f9c476a090  rack1
> UN  7  583.72 MB  256     ?       2b6b5d66-13c8-43d8-855c-290c0f3c3a0b  rack1
> UN  6  583.84 MB  256     ?       b8bd67f1-a816-46ff-b4a4-136ad5af6d4b  rack1
> Repair -pr -par on all nodes sequentially
> UN  9  746.29 MB  256     ?       28220962-26ae-4eeb-8027-99f96e377406  rack1
> UN  8  751.02 MB  256     ?       f2de6ea1-de88-4056-8fde-42f9c476a090  rack1
> UN  7  748.89 MB  256     ?       2b6b5d66-13c8-43d8-855c-290c0f3c3a0b  rack1
> UN  6  758.34 MB  256     ?       b8bd67f1-a816-46ff-b4a4-136ad5af6d4b  rack1
> repair -inc -par on all nodes sequentially
> UN  9  2.41 GB    256     ?       28220962-26ae-4eeb-8027-99f96e377406  rack1
> UN  8  2.53 GB    256     ?       f2de6ea1-de88-4056-8fde-42f9c476a090  rack1
> UN  7  2.6 GB     256     ?       2b6b5d66-13c8-43d8-855c-290c0f3c3a0b  rack1
> UN  6  2.17 GB    256     ?       b8bd67f1-a816-46ff-b4a4-136ad5af6d4b  rack1
> after rolling restart
> UN  9  1.47 GB    256     ?       28220962-26ae-4eeb-8027-99f96e377406  rack1
> UN  8  1.5 GB     256     ?       f2de6ea1-de88-4056-8fde-42f9c476a090  rack1
> UN  7  2.46 GB    256     ?       2b6b5d66-13c8-43d8-855c-290c0f3c3a0b  rack1
> UN  6  1.19 GB    256     ?       b8bd67f1-a816-46ff-b4a4-136ad5af6d4b  rack1
> compact all nodes sequentially
> UN  9  989.99 MB  256     ?       28220962-26ae-4eeb-8027-99f96e377406  rack1
> UN  8  994.75 MB  256     ?       f2de6ea1-de88-4056-8fde-42f9c476a090  rack1
> UN  7  1.46 GB    256     ?       2b6b5d66-13c8-43d8-855c-290c0f3c3a0b  rack1
> UN  6  758.82 MB  256     ?       b8bd67f1-a816-46ff-b4a4-136ad5af6d4b  rack1
> repair -inc -par on all nodes sequentially
> UN  9  1.98 GB    256     ?       28220962-26ae-4eeb-8027-99f96e377406  rack1
> UN  8  2.3 GB     256     ?       f2de6ea1-de88-4056-8fde-42f9c476a090  rack1
> UN  7  3.71 GB    256     ?       2b6b5d66-13c8-43d8-855c-290c0f3c3a0b  rack1
> UN  6  1.68 GB    256     ?       b8bd67f1-a816-46ff-b4a4-136ad5af6d4b  rack1
> restart once more
> UN  9  2 GB       256     ?       28220962-26ae-4eeb-8027-99f96e377406  rack1
> UN  8  2.05 GB    256     ?       f2de6ea1-de88-4056-8fde-42f9c476a090  rack1
> UN  7  4.1 GB     256     ?       2b6b5d66-13c8-43d8-855c-290c0f3c3a0b  rack1
> UN  6  1.68 GB    256     ?       b8bd67f1-a816-46ff-b4a4-136ad5af6d4b  rack1
> {noformat}
> Is there something im missing or is this strange behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CASSANDRA-8366) Repair grows data on nodes, causes load to become unbalanced

Reply via email to