Hi Jens, I am reading Cassandra The definitive guide and there is a chapter 9 - Reading and Writing Data and section The Cassandra Write Path and this sentence in it:
If a replica does not respond within the timeout, it is presumed to be down and a hint is stored for the write. So your node might be actually fine eventually but it just can not cope with the load and it will reply too late after a coordinator has sufficient replies from other replicas. So it makes a hint for that write and for that node. I am not sure how is this related to turning off handoffs completely. I can do some tests locally if time allows to investigate various scenarios. There might be some subtle differences .... On Wed, 3 Apr 2019 at 07:19, Jens Fischer <j.fisc...@sonnen.de> wrote: > Yes, Apache Cassandra 3.11.2 (no DSE). > > On 2. Apr 2019, at 19:40, sankalp kohli <kohlisank...@gmail.com> wrote: > > Are you using OSS C*? > > On Fri, Mar 29, 2019 at 1:49 AM Jens Fischer <j.fisc...@sonnen.de> wrote: > >> Hi, >> >> I have a Cassandra setup with multiple data centres. The vast majority of >> writes are LOCAL_ONE writes to data center DC-A. One node (lets call this >> node A1) in DC-A has accumulated large amounts of hint files (~100 GB). In >> the logs of this node I see lots of messages like the following: >> >> INFO [HintsDispatcher:26] 2019-03-28 01:49:25,217 >> HintsDispatchExecutor.java:289 - Finished hinted handoff of file >> db485ac6-8acd-4241-9e21-7a2b540459de-1553419324363-1.hints to endpoint / >> 10.10.2.55: db485ac6-8acd-4241-9e21-7a2b540459de >> >> The node 10.10.2.55 is in DC-B, lets call this node B1. There is no >> indication whatsoever that B1 was down: Nothing in our monitoring, nothing >> in the logs of B1, nothing in the logs of A1. Are there any other >> situations where hints to B1 are stored at A1? Other than A1's failure >> detection detecting B1 as down I mean. For example could the reason for the >> hints be that B1 is overloaded and can not handle the intake from the A1? >> Or that the network connection between DC-A and DC-B is to slow? >> >> While researching this I also found the following information on Stack >> Overflow from Ben Slater regarding hints and multi-dc replication: >> >> Another factor here is the consistency level you are using - a LOCAL_* >> consistency level will only require writes to be written to the local DC >> for the operation to be considered a success (and hints will be stored for >> replication to the other DC). >> (…) >> The hints are the records of writes that have been made in one DC that >> are not yet replicated to the other DC (or even nodes within a DC). I think >> your options to avoid them are: (1) write with ALL or QUOROM (not LOCAL_*) >> consistency - this will slow down your writes but will ensure writes go >> into both DCs before the op completes (2) Don't replicate the data to the >> second DC (by setting the replication factor to 0 for the second DC in the >> keyspace definition) (3) Increase the capacity of the second DC so it can >> keep up with the writes (4) Slow down your writes so the second DC can keep >> up. >> >> >> Source: https://stackoverflow.com/a/37382726 >> >> This reads like hints are used for “normal” (async) replication between >> data centres, i.e. hints could show up without any nodes being down >> whatsoever. This could explain what I am seeing. Does anyone now more about >> this? Does that mean I will see hints even if I disable hinted handoff? >> >> Any pointers or help are greatly appreciated! >> >> Thanks in advance >> Jens >> >> Geschäftsführer: Christoph Ostermann (CEO), Oliver Koch, Steffen >> Schneider, Hermann Schweizer. >> Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer >> 127/137/50792, USt.-IdNr. DE272208908 >> > > Geschäftsführer: Christoph Ostermann (CEO), Oliver Koch, Steffen > Schneider, Hermann Schweizer. > Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer > 127/137/50792, USt.-IdNr. DE272208908 >