[jira] [Comment Edited] (CASSANDRA-5220) Repair improvements when using vnodes

Donald Smith (JIRA) Mon, 23 Dec 2013 12:51:21 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855774#comment-13855774
 ]


Donald Smith edited comment on CASSANDRA-5220 at 12/23/13 8:49 PM:
-------------------------------------------------------------------

 We ran "nodetool repair" on a 3 node cassandra cluster with production-quality 
hardware, using version 2.0.3. Each node had about 1TB of data. This is still 
testing.  After 5 days the repair job still hasn't finished. I can see it's 
still running.

Here's the process:
{noformat}
root     30835 30774  0 Dec17 pts/0    00:03:53 /usr/bin/java -cp 
/etc/cassandra/conf:/usr/share/java/jna.jar:/usr/share/cassandra/lib/antlr-3.2.jar:/usr/share/cassandra/lib/apache-cassandra-2.0.3.jar:/usr/share/cassandra/lib/apache-cassandra-clientutil-2.0.3.jar:/usr/share/cassandra/lib/apache-cassandra-thrift-2.0.3.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-lang3-3.1.jar:/usr/share/cassandra/lib/compress-lzf-0.8.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.3.jar:/usr/share/cassandra/lib/disruptor-3.0.1.jar:/usr/share/cassandra/lib/guava-15.0.jar:/usr/share/cassandra/lib/high-scale-lib-1.1.2.jar:/usr/share/cassandra/lib/jackson-core-asl-1.9.2.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.9.2.jar:/usr/share/cassandra/lib/jamm-0.2.5.jar:/usr/share/cassandra/lib/jbcrypt-0.3m.jar:/usr/share/cassandra/lib/jline-1.0.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/libthrift-0.9.1.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassandra/lib/lz4-1.2.0.jar:/usr/share/cassandra/lib/metrics-core-2.2.0.jar:/usr/share/cassandra/lib/netty-3.6.6.Final.jar:/usr/share/cassandra/lib/reporter-config-2.1.0.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar:/usr/share/cassandra/lib/slf4j-api-1.7.2.jar:/usr/share/cassandra/lib/slf4j-log4j12-1.7.2.jar:/usr/share/cassandra/lib/snakeyaml-1.11.jar:/usr/share/cassandra/lib/snappy-java-1.0.5.jar:/usr/share/cassandra/lib/snaptree-0.1.jar:/usr/share/cassandra/lib/stress.jar:/usr/share/cassandra/lib/thrift-server-0.3.2.jar
 -Xmx32m -Dlog4j.configuration=log4j-tools.properties 
-Dstorage-config=/etc/cassandra/conf org.apache.cassandra.tools.NodeCmd -p 7199 
repair -pr as_reports
{noformat}

The log output has just:
{noformat}
xss =  -ea -javaagent:/usr/share/cassandra/lib/jamm-0.2.5.jar 
-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8192M -Xmx8192M 
-Xmn2048M -XX:+HeapDumpOnOutOfMemoryError -Xss256k
[2013-12-17 23:26:48,144] Starting repair command #1, repairing 256 ranges for 
keyspace as_reports
{noformat}

Here's the output of "nodetool tpstats":
{noformat}
cass3 /tmp> nodetool tpstats
xss =  -ea -javaagent:/usr/share/cassandra/lib/jamm-0.2.5.jar 
-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8192M -Xmx8192M 
-Xmn2048M -XX:+HeapDumpOnOutOfMemoryError -Xss256k
Pool Name                    Active   Pending      Completed   Blocked  All 
time blocked
ReadStage                         1         0       38083403         0          
       0
RequestResponseStage              0         0     1951200451         0          
       0
MutationStage                     0         0     2853354069         0          
       0
ReadRepairStage                   0         0        3794926         0          
       0
ReplicateOnWriteStage             0         0              0         0          
       0
GossipStage                       0         0        4880147         0          
       0
AntiEntropyStage                  1         3              9         0          
       0
MigrationStage                    0         0             30         0          
       0
MemoryMeter                       0         0            115         0          
       0
MemtablePostFlusher               0         0          75121         0          
       0
FlushWriter                       0         0          49934         0          
      52
MiscStage                         0         0              0         0          
       0
PendingRangeCalculator            0         0              7         0          
       0
commitlog_archiver                0         0              0         0          
       0
AntiEntropySessions               1         1              1         0          
       0
InternalResponseStage             0         0              9         0          
       0
HintedHandoff                     0         0           1141         0          
       0

Message type           Dropped
RANGE_SLICE                  0
READ_REPAIR                  0
PAGED_RANGE                  0
BINARY                       0
READ                       884
MUTATION               1407711
_TRACE                       0
REQUEST_RESPONSE             0
{noformat}
The cluster has some write traffic to it. We decided to test it under load.
This is the busiest column family, as reported by "nodetool cfstats":
{noformat}
   Read Count: 38084316
        Read Latency: 9.409910464927346 ms.
        Write Count: 2850436738
        Write Latency: 0.8083138546641199 ms.
        Pending Tasks: 0
....
    Table: data_report_details
                SSTable count: 592
                Space used (live), bytes: 160644106183
                Space used (total), bytes: 160663248847
                SSTable Compression Ratio: 0.5296494510512617
                Number of keys (estimate): 51015040
                Memtable cell count: 311180
                Memtable data size, bytes: 46275953
                Memtable switch count: 6100
                Local read count: 6147
                Local read latency: 154.539 ms
                Local write count: 750865416
                Local write latency: 0.029 ms
                Pending tasks: 0
                Bloom filter false positives: 265
                Bloom filter false ratio: 0.06009
                Bloom filter space used, bytes: 64690104
                Compacted partition minimum bytes: 30
                Compacted partition maximum bytes: 10090808
                Compacted partition mean bytes: 5267
                Average live cells per slice (last five minutes): 1.0
                Average tombstones per slice (last five minutes): 0.0
{noformat}
We're gonna restart the node.  We barely do deletes or updates (only if a 
report is re-uploaded), so we suspect that we can get by without doing repairs. 
Correct me if we're wrong about that.

 nodetool compactionstats outputs:
{noformat}
xss =  -ea -javaagent:/usr/share/cassandra/lib/jamm-0.2.5.jar 
-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8192M -Xmx8192M 
-Xmn2048M -XX:+HeapDumpOnOutOfMemoryError -Xss256k
pending tasks: 166
          compaction type        keyspace           table       completed       
    total      unit  progress
               Compaction      as_reportsdata_report_details_below_threshold    
   971187148      1899419306     bytes    51.13%
               Compaction      as_reportsdata_report_details_below_threshold    
   950086203      1941500979     bytes    48.94%
               Compaction      as_reportsdata_hierarchy_details      2968934609 
     5808990354     bytes    51.11%
               Compaction      as_reportsdata_report_details_below_threshold    
   945816183      1900166474     bytes    49.78%
               Compaction      as_reportsdata_report_details_below_threshold    
   899143344      1943534395     bytes    46.26%
               Compaction      as_reportsdata_report_details_below_threshold    
   856329840      1946566670     bytes    43.99%
               Compaction      as_reportsdata_report_details       195235688    
   915395763     bytes    21.33%
               Compaction      as_reportsdata_report_details_below_threshold    
   982460217      1931001761     bytes    50.88%
               Compaction      as_reportsdata_report_details_below_threshold    
   896609409      1931075688     bytes    46.43%
               Compaction      as_reportsdata_report_details_below_threshold    
   869219044      1928977382     bytes    45.06%
               Compaction      as_reportsdata_report_details_below_threshold    
   870931112      1901729646     bytes    45.80%
               Compaction      as_reportsdata_report_details_below_threshold    
   879343635      1939491280     bytes    45.34%
               Compaction      as_reportsdata_report_details_below_threshold    
   981888944      1893024439     bytes    51.87%
               Compaction      as_reportsdata_report_details_below_threshold    
   871785587      1884652607     bytes    46.26%
               Compaction      as_reportsdata_report_details_below_threshold    
   902340327      1913280943     bytes    47.16%
               Compaction      as_reportsdata_report_details_below_threshold    
  1025069846      1901568674     bytes    53.91%
               Compaction      as_reportsdata_report_details_below_threshold    
   920112020      1893272832     bytes    48.60%
               Compaction      as_reportsdata_hierarchy_details      2962138268 
     5774762866     bytes    51.29%
               Compaction      as_reportsdata_report_details_below_threshold    
   790782860      1918640911     bytes    41.22%
               Compaction      as_reportsdata_hierarchy_details      2972501409 
     5885217724     bytes    50.51%
               Compaction      as_reportsdata_report_details_below_threshold    
  1611697659      1939040337     bytes    83.12%
               Compaction      as_reportsdata_report_details_below_threshold    
   943130526      1943713837     bytes    48.52%
               Compaction      as_reportsdata_report_details_below_threshold    
   911127302      1952885196     bytes    46.66%
               Compaction      as_reportsdata_report_details_below_threshold    
   911230087      1927967871     bytes    47.26%
{noformat}

Now "nodetool tpstats" says:
{noformat}
Pool Name                    Active   Pending      Completed   Blocked  All 
time blocked
AntiEntropyStage                  1         3              9         0          
       0
{noformat}

We ran "nodetool repair -pr" on 10.1.40.43. Here are references to it. So, 
maybe the nodetool repair job hung. 
{noformat}
cass3 /var/log/cassandra> grep -i repair system.log.? | grep -i merkle
system.log.1: INFO [AntiEntropySessions:1] 2013-12-17 23:26:48,459 
RepairJob.java (line 116) [repair #c1540f60-67b5-11e3-b8b7-fb178cd88033] 
requesting merkle trees for data_report_details_by_uus (to [/10.1.40.42, 
dc1-cassandra-staging-03.dc01.revsci.net/10.1.40.43])
system.log.1: INFO [AntiEntropyStage:1] 2013-12-17 23:26:48,807 
RepairSession.java (line 157) [repair #c1540f60-67b5-11e3-b8b7-fb178cd88033] 
Received merkle tree for data_report_details_by_uus from /10.1.40.42
system.log.1: INFO [AntiEntropyStage:1] 2013-12-17 23:26:49,091 
RepairSession.java (line 157) [repair #c1540f60-67b5-11e3-b8b7-fb178cd88033] 
Received merkle tree for data_report_details_by_uus from /10.1.40.43
system.log.1: INFO [AntiEntropyStage:1] 2013-12-19 03:58:31,007 RepairJob.java 
(line 116) [repair #c1540f60-67b5-11e3-b8b7-fb178cd88033] requesting merkle 
trees for data_hierarchy_details (to [/10.1.40.42, 
dc1-cassandra-staging-03.dc01.revsci.net/10.1.40.43])
system.log.1: INFO [AntiEntropySessions:5] 2013-12-19 03:58:31,012 
RepairJob.java (line 116) [repair #e0ff9ba0-68a4-11e3-b8b7-fb178cd88033] 
requesting merkle trees for data_report_details_by_uus (to [/10.1.40.41, 
dc1-cassandra-staging-03.dc01.revsci.net/10.1.40.43])
system.log.1: INFO [AntiEntropyStage:1] 2013-12-19 03:58:31,316 
RepairSession.java (line 157) [repair #e0ff9ba0-68a4-11e3-b8b7-fb178cd88033] 
Received merkle tree for data_report_details_by_uus from /10.1.40.41
system.log.1: INFO [AntiEntropyStage:1] 2013-12-19 03:58:31,431 
RepairSession.java (line 157) [repair #e0ff9ba0-68a4-11e3-b8b7-fb178cd88033] 
Received merkle tree for data_report_details_by_uus from /10.1.40.43
{noformat}


was (Author: thinkerfeeler):
 We ran "nodetool repair" on a 3 node cassandra cluster with production-quality 
hardware, using version 2.0.3. Each node had about 1TB of data. This is still 
testing.  After 5 days the repair job still hasn't finished. I can see it's 
still running.

Here's the process:
{noformat}
root     30835 30774  0 Dec17 pts/0    00:03:53 /usr/bin/java -cp 
/etc/cassandra/conf:/usr/share/java/jna.jar:/usr/share/cassandra/lib/antlr-3.2.jar:/usr/share/cassandra/lib/apache-cassandra-2.0.3.jar:/usr/share/cassandra/lib/apache-cassandra-clientutil-2.0.3.jar:/usr/share/cassandra/lib/apache-cassandra-thrift-2.0.3.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-lang3-3.1.jar:/usr/share/cassandra/lib/compress-lzf-0.8.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.3.jar:/usr/share/cassandra/lib/disruptor-3.0.1.jar:/usr/share/cassandra/lib/guava-15.0.jar:/usr/share/cassandra/lib/high-scale-lib-1.1.2.jar:/usr/share/cassandra/lib/jackson-core-asl-1.9.2.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.9.2.jar:/usr/share/cassandra/lib/jamm-0.2.5.jar:/usr/share/cassandra/lib/jbcrypt-0.3m.jar:/usr/share/cassandra/lib/jline-1.0.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/libthrift-0.9.1.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassandra/lib/lz4-1.2.0.jar:/usr/share/cassandra/lib/metrics-core-2.2.0.jar:/usr/share/cassandra/lib/netty-3.6.6.Final.jar:/usr/share/cassandra/lib/reporter-config-2.1.0.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar:/usr/share/cassandra/lib/slf4j-api-1.7.2.jar:/usr/share/cassandra/lib/slf4j-log4j12-1.7.2.jar:/usr/share/cassandra/lib/snakeyaml-1.11.jar:/usr/share/cassandra/lib/snappy-java-1.0.5.jar:/usr/share/cassandra/lib/snaptree-0.1.jar:/usr/share/cassandra/lib/stress.jar:/usr/share/cassandra/lib/thrift-server-0.3.2.jar
 -Xmx32m -Dlog4j.configuration=log4j-tools.properties 
-Dstorage-config=/etc/cassandra/conf org.apache.cassandra.tools.NodeCmd -p 7199 
repair -pr as_reports
{noformat}

The log output has just:
{noformat}
xss =  -ea -javaagent:/usr/share/cassandra/lib/jamm-0.2.5.jar 
-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8192M -Xmx8192M 
-Xmn2048M -XX:+HeapDumpOnOutOfMemoryError -Xss256k
[2013-12-17 23:26:48,144] Starting repair command #1, repairing 256 ranges for 
keyspace as_reports
{noformat}

Here's the output of "nodetool tpstats":
{noformat}
cass3 /tmp> nodetool tpstats
xss =  -ea -javaagent:/usr/share/cassandra/lib/jamm-0.2.5.jar 
-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8192M -Xmx8192M 
-Xmn2048M -XX:+HeapDumpOnOutOfMemoryError -Xss256k
Pool Name                    Active   Pending      Completed   Blocked  All 
time blocked
ReadStage                         1         0       38083403         0          
       0
RequestResponseStage              0         0     1951200451         0          
       0
MutationStage                     0         0     2853354069         0          
       0
ReadRepairStage                   0         0        3794926         0          
       0
ReplicateOnWriteStage             0         0              0         0          
       0
GossipStage                       0         0        4880147         0          
       0
AntiEntropyStage                  1         3              9         0          
       0
MigrationStage                    0         0             30         0          
       0
MemoryMeter                       0         0            115         0          
       0
MemtablePostFlusher               0         0          75121         0          
       0
FlushWriter                       0         0          49934         0          
      52
MiscStage                         0         0              0         0          
       0
PendingRangeCalculator            0         0              7         0          
       0
commitlog_archiver                0         0              0         0          
       0
AntiEntropySessions               1         1              1         0          
       0
InternalResponseStage             0         0              9         0          
       0
HintedHandoff                     0         0           1141         0          
       0

Message type           Dropped
RANGE_SLICE                  0
READ_REPAIR                  0
PAGED_RANGE                  0
BINARY                       0
READ                       884
MUTATION               1407711
_TRACE                       0
REQUEST_RESPONSE             0
{noformat}
The cluster has some write traffic to it. We decided to test it under load.
This is the busiest column family, as reported by "nodetool cfstats":
{noformat}
   Read Count: 38084316
        Read Latency: 9.409910464927346 ms.
        Write Count: 2850436738
        Write Latency: 0.8083138546641199 ms.
        Pending Tasks: 0
....
    Table: data_report_details
                SSTable count: 592
                Space used (live), bytes: 160644106183
                Space used (total), bytes: 160663248847
                SSTable Compression Ratio: 0.5296494510512617
                Number of keys (estimate): 51015040
                Memtable cell count: 311180
                Memtable data size, bytes: 46275953
                Memtable switch count: 6100
                Local read count: 6147
                Local read latency: 154.539 ms
                Local write count: 750865416
                Local write latency: 0.029 ms
                Pending tasks: 0
                Bloom filter false positives: 265
                Bloom filter false ratio: 0.06009
                Bloom filter space used, bytes: 64690104
                Compacted partition minimum bytes: 30
                Compacted partition maximum bytes: 10090808
                Compacted partition mean bytes: 5267
                Average live cells per slice (last five minutes): 1.0
                Average tombstones per slice (last five minutes): 0.0
{noformat}
We're gonna restart the node.  We barely do deletes or updates (only if a 
report is re-uploaded), so we suspect that we can get by without doing repairs. 
Correct me if we're wrong about that.

 nodetool compactionstats outputs:
{noformat}
xss =  -ea -javaagent:/usr/share/cassandra/lib/jamm-0.2.5.jar 
-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8192M -Xmx8192M 
-Xmn2048M -XX:+HeapDumpOnOutOfMemoryError -Xss256k
pending tasks: 166
          compaction type        keyspace           table       completed       
    total      unit  progress
               Compaction      as_reportsdata_report_details_below_threshold    
   971187148      1899419306     bytes    51.13%
               Compaction      as_reportsdata_report_details_below_threshold    
   950086203      1941500979     bytes    48.94%
               Compaction      as_reportsdata_hierarchy_details      2968934609 
     5808990354     bytes    51.11%
               Compaction      as_reportsdata_report_details_below_threshold    
   945816183      1900166474     bytes    49.78%
               Compaction      as_reportsdata_report_details_below_threshold    
   899143344      1943534395     bytes    46.26%
               Compaction      as_reportsdata_report_details_below_threshold    
   856329840      1946566670     bytes    43.99%
               Compaction      as_reportsdata_report_details       195235688    
   915395763     bytes    21.33%
               Compaction      as_reportsdata_report_details_below_threshold    
   982460217      1931001761     bytes    50.88%
               Compaction      as_reportsdata_report_details_below_threshold    
   896609409      1931075688     bytes    46.43%
               Compaction      as_reportsdata_report_details_below_threshold    
   869219044      1928977382     bytes    45.06%
               Compaction      as_reportsdata_report_details_below_threshold    
   870931112      1901729646     bytes    45.80%
               Compaction      as_reportsdata_report_details_below_threshold    
   879343635      1939491280     bytes    45.34%
               Compaction      as_reportsdata_report_details_below_threshold    
   981888944      1893024439     bytes    51.87%
               Compaction      as_reportsdata_report_details_below_threshold    
   871785587      1884652607     bytes    46.26%
               Compaction      as_reportsdata_report_details_below_threshold    
   902340327      1913280943     bytes    47.16%
               Compaction      as_reportsdata_report_details_below_threshold    
  1025069846      1901568674     bytes    53.91%
               Compaction      as_reportsdata_report_details_below_threshold    
   920112020      1893272832     bytes    48.60%
               Compaction      as_reportsdata_hierarchy_details      2962138268 
     5774762866     bytes    51.29%
               Compaction      as_reportsdata_report_details_below_threshold    
   790782860      1918640911     bytes    41.22%
               Compaction      as_reportsdata_hierarchy_details      2972501409 
     5885217724     bytes    50.51%
               Compaction      as_reportsdata_report_details_below_threshold    
  1611697659      1939040337     bytes    83.12%
               Compaction      as_reportsdata_report_details_below_threshold    
   943130526      1943713837     bytes    48.52%
               Compaction      as_reportsdata_report_details_below_threshold    
   911127302      1952885196     bytes    46.66%
               Compaction      as_reportsdata_report_details_below_threshold    
   911230087      1927967871     bytes    47.26%
{noformat}

Now "nodetool tpstats" says:
{noformat}
Pool Name                    Active   Pending      Completed   Blocked  All 
time blocked
AntiEntropyStage                  1         3              9         0          
       0
{noformat}

> Repair improvements when using vnodes
> -------------------------------------
>
>                 Key: CASSANDRA-5220
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5220
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 1.2.0 beta 1
>            Reporter: Brandon Williams
>            Assignee: Yuki Morishita
>             Fix For: 2.1
>
>
> Currently when using vnodes, repair takes much longer to complete than 
> without them.  This appears at least in part because it's using a session per 
> range and processing them sequentially.  This generates a lot of log spam 
> with vnodes, and while being gentler and lighter on hard disk deployments, 
> ssd-based deployments would often prefer that repair be as fast as possible.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Comment Edited] (CASSANDRA-5220) Repair improvements when using vnodes

Reply via email to