Hi Jens,

I am reading Cassandra The definitive guide and there is a chapter 9 -
Reading and Writing Data and section The Cassandra Write Path and this
sentence in it:

If a replica does not respond within the timeout, it is presumed to be down
and a hint is stored for the write.

So your node might be actually fine eventually but it just can not cope
with the load and it will reply too late after a coordinator has sufficient
replies from other replicas. So it makes a hint for that write and for that
node. I am not sure how is this related to turning off handoffs completely.
I can do some tests locally if time allows to investigate various
scenarios. There might be some subtle differences ....

On Wed, 3 Apr 2019 at 07:19, Jens Fischer <j.fisc...@sonnen.de> wrote:

> Yes, Apache Cassandra 3.11.2 (no DSE).
>
> On 2. Apr 2019, at 19:40, sankalp kohli <kohlisank...@gmail.com> wrote:
>
> Are you using OSS C*?
>
> On Fri, Mar 29, 2019 at 1:49 AM Jens Fischer <j.fisc...@sonnen.de> wrote:
>
>> Hi,
>>
>> I have a Cassandra setup with multiple data centres. The vast majority of
>> writes are LOCAL_ONE writes to data center DC-A. One node (lets call this
>> node A1) in DC-A has accumulated large amounts of hint files (~100 GB). In
>> the logs of this node I see lots of messages like the following:
>>
>> INFO  [HintsDispatcher:26] 2019-03-28 01:49:25,217
>> HintsDispatchExecutor.java:289 - Finished hinted handoff of file
>> db485ac6-8acd-4241-9e21-7a2b540459de-1553419324363-1.hints to endpoint /
>> 10.10.2.55: db485ac6-8acd-4241-9e21-7a2b540459de
>>
>> The node 10.10.2.55 is in DC-B, lets call this node B1. There is no
>> indication whatsoever that B1 was down: Nothing in our monitoring, nothing
>> in the logs of B1, nothing in the logs of A1. Are there any other
>> situations where hints to B1 are stored at A1? Other than A1's failure
>> detection detecting B1 as down I mean. For example could the reason for the
>> hints be that B1 is overloaded and can not handle the intake from the A1?
>> Or that the network connection between DC-A and DC-B is to slow?
>>
>> While researching this I also found the following information on Stack
>> Overflow from Ben Slater regarding hints and multi-dc replication:
>>
>> Another factor here is the consistency level you are using - a LOCAL_*
>> consistency level will only require writes to be written to the local DC
>> for the operation to be considered a success (and hints will be stored for
>> replication to the other DC).
>> (…)
>> The hints are the records of writes that have been made in one DC that
>> are not yet replicated to the other DC (or even nodes within a DC). I think
>> your options to avoid them are: (1) write with ALL or QUOROM (not LOCAL_*)
>> consistency - this will slow down your writes but will ensure writes go
>> into both DCs before the op completes (2) Don't replicate the data to the
>> second DC (by setting the replication factor to 0 for the second DC in the
>> keyspace definition) (3) Increase the capacity of the second DC so it can
>> keep up with the writes (4) Slow down your writes so the second DC can keep
>> up.
>>
>>
>> Source: https://stackoverflow.com/a/37382726
>>
>> This reads like hints are used for “normal” (async) replication between
>> data centres, i.e. hints could show up without any nodes being down
>> whatsoever. This could explain what I am seeing. Does anyone now more about
>> this? Does that mean I will see hints even if I disable hinted handoff?
>>
>> Any pointers or help are greatly appreciated!
>>
>> Thanks in advance
>> Jens
>>
>> Geschäftsführer: Christoph Ostermann (CEO), Oliver Koch, Steffen
>> Schneider, Hermann Schweizer.
>> Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer
>> 127/137/50792, USt.-IdNr. DE272208908
>>
>
> Geschäftsführer: Christoph Ostermann (CEO), Oliver Koch, Steffen
> Schneider, Hermann Schweizer.
> Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer
> 127/137/50792, USt.-IdNr. DE272208908
>

Reply via email to