Thanks Robert.  Does the switch to sequential from parallel explain why IO
increases, we see significantly higher IO with 2.10.

The nodetool docs [1] hint at the reason for defaulting to sequential,

"This allows the dynamic snitch to maintain performance for your
application via the other replicas, because at least one replica in the
snapshot is not undergoing repair."

Sean

[1]
http://www.datastax.com/documentation/cassandra/2.0/cassandra/tools/toolsRepair.html


On Wed, Oct 15, 2014 at 5:36 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Wed, Oct 15, 2014 at 4:54 PM, Sean Bridges <sean.brid...@gmail.com>
> wrote:
>
>> We upgraded a cassandra cluster from 1.2.18 to 2.0.10, and it looks like
>> repair is significantly more expensive now.  Is this expected?
>>
>
> It depends on what you mean by "expected." Operators usually don't expect
> defaults with such dramatic impacts to change without them understanding
> why, but there is a reason for it.
>
> In 2.0 the default for repair was changed to be non-parallel. To get the
> old behavior, you need to supply -par as an argument.
>
> The context-free note you missed the significance of in NEWS.txt for
> version 2.0.2 says :
>
> - Nodetool defaults to Sequential mode for repair operations
>
> What this doesn't say is how almost certainly unreasonable this is as a
> default, because this means that repair is predictably slower in direct
> relationship to your replication factor, and the default for
> gc_grace_seconds (the time box in which one must complete a repair) did not
> change at the same time. The ticket where the change happens [1] does not
> specify a rationale, so your guess is as good as mine as to the reasoning
> which not only felt the change was necessary but reasonable.
>
> Leaving aside the problem you've encountered ("upgraders notice that their
> repairs (which already took forever) are suddenly WAY SLOWER") this default
> is also quite pathological for anyone operating with a RF over 3, which are
> valid, if very uncommon, configurations.
>
> In summary, if, as an operator, you disagree that making repair slower by
> default as a factor of replication factor is reasonable, I suggest filing a
> JIRA and letting the project know. At least in that case there is a chance
> they might explain the rationale for so blithely making a change that has
> inevitable impact on operators... ?
>
> =Rob
> [1] https://issues.apache.org/jira/browse/CASSANDRA-5950
> http://twitter.com/rcolidba
>

Reply via email to