Re: Repair of 5GB data vs. disk throughput does not make sense

horschi Thu, 26 Apr 2018 13:07:23 -0700

Hi Thomas,

I don't think I have seen compaction ever being faster.


For me, tables with small values usually are around 5 MB/s with a single
compaction. With larger blobs (few KB per blob) I have seen 16MB/s. Both
with "nodetool setcompactionthroughput 0".

I don't think its disk related either. I think parsing the data simply
utilizes the CPU or perhaps the issue is GC related? But I have never dug
into it, I just observed low IO-wait percentages in top.

regards,
Christian




On Thu, Apr 26, 2018 at 7:39 PM, Jonathan Haddad <j...@jonhaddad.com> wrote:

> I can't say for sure, because I haven't measured it, but I've seen a
> combination of readahead + large chunk size with compression cause serious
> issues with read amplification, although I'm not sure if or how it would
> apply here.  Likely depends on the size of your partitions and the
> fragmentation of the sstables, although at only 5GB I'm really surprised to
> hear 32GB read in, that seems a bit absurd.
>
> Definitely something to dig deeper into.
>
> On Thu, Apr 26, 2018 at 5:02 AM Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
>> Hello,
>>
>>
>>
>> yet another question/issue with repair.
>>
>>
>>
>> Cassandra 2.1.18, 3 nodes, RF=3, vnode=256, data volume ~ 5G per node
>> only. A repair (nodetool repair -par) issued on a single node at this data
>> volume takes around 36min with an AVG of ~ 15MByte/s disk throughput
>> (read+write) for the entire time-frame, thus processing ~ 32GByte from a
>> disk perspective so ~ 6 times of the real data volume reported by nodetool
>> status. Does this make any sense? This is with 4 compaction threads and
>> compaction throughput = 64. Similar results doing this test a few times,
>> where most/all inconsistent data should be already sorted out by previous
>> runs.
>>
>>
>>
>> I know there is e.g. reaper, but the above is a simple use case simply
>> after a single failed node recovers beyond the 3h hinted handoff window.
>> How should this finish in a timely manner for > 500G on a recovering node?
>>
>>
>>
>> I have to admit this is with NFS as storage. I know, NFS might not be the
>> best idea, but with the above test at ~ 5GB data volume, we see an IOPS
>> rate at ~ 700 at a disk latency of ~ 15ms, thus I wouldn’t treat it as that
>> bad. This all is using/running Cassandra on-premise (at the customer, so
>> not hosted by us), so while we can make recommendations storage-wise (of
>> course preferring local disks), it may and will happen that NFS is being in
>> use then.
>>
>>
>>
>> Why we are using -par in combination with NFS is a different story and
>> related to this issue: https://issues.apache.org/
>> jira/browse/CASSANDRA-8743. Without switching from sequential to
>> parallel repair, we basically kill Cassandra.
>>
>>
>>
>> Throughput-wise, I also don’t think it is related to NFS, cause we see
>> similar repair throughput values with AWS EBS (gp2, SSD based) running
>> regular repairs on small-sized CFs.
>>
>>
>>
>> Thanks for any input.
>>
>> Thomas
>> The contents of this e-mail are intended for the named addressee only. It
>> contains information that may be confidential. Unless you are the named
>> addressee or an authorized designee, you may not copy or use it, or
>> disclose it to anyone else. If you received it in error please notify us
>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>> number FN 91482h) is a company registered in Linz whose registered office
>> is at 4040 Linz, Austria, Freistädterstraße 313
>>
>

Re: Repair of 5GB data vs. disk throughput does not make sense

Reply via email to