This is an automated email from the ASF dual-hosted git repository. jolynch pushed a commit to branch trunk in repository https://gitbox.apache.org/repos/asf/cassandra.git
The following commit(s) were added to refs/heads/trunk by this push: new 8402d1f Add documentation of hints 8402d1f is described below commit 8402d1f1456dc4da279f53dbd02f5ce7a1b2dffc Author: dvohra <dvohr...@yahoo.com> AuthorDate: Mon Jan 6 19:37:12 2020 -0800 Add documentation of hints Patch by Deepak Vohra; Reviewed by Joseph Lynch for CASSANDRA-15491 --- CHANGES.txt | 1 + doc/source/architecture/dynamo.rst | 2 + doc/source/operating/hints.rst | 259 +++++++++++++++++++++++++++++++++- doc/source/operating/images/hints.svg | 9 ++ doc/source/operating/metrics.rst | 4 + 5 files changed, 273 insertions(+), 2 deletions(-) diff --git a/CHANGES.txt b/CHANGES.txt index 5fe958c..f1da0b7 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -1,4 +1,5 @@ 4.0-alpha4 + * Add documentation of hints (CASSANDRA-15491) * updateCoordinatorWriteLatencyTableMetric can produce misleading metrics (CASSANDRA-15569) * Added documentation for read repair and an example of full repair (CASSANDRA-15485) * Make cqlsh and cqlshlib Python 2 & 3 compatible (CASSANDRA-10190) diff --git a/doc/source/architecture/dynamo.rst b/doc/source/architecture/dynamo.rst index 12c586e..380abc2 100644 --- a/doc/source/architecture/dynamo.rst +++ b/doc/source/architecture/dynamo.rst @@ -29,6 +29,8 @@ Failure Detection .. todo:: todo +.. _token-range: + Token Ring/Ranges ^^^^^^^^^^^^^^^^^ diff --git a/doc/source/operating/hints.rst b/doc/source/operating/hints.rst index f79f18a..94ff16f 100644 --- a/doc/source/operating/hints.rst +++ b/doc/source/operating/hints.rst @@ -17,6 +17,261 @@ .. highlight:: none Hints ------ +===== -.. todo:: todo +Hinting is a data repair technique applied during write operations. When +replica nodes are unavailable to accept a mutation, either due to failure or +more commonly routine maintenance, coordinators attempting to write to those +replicas store temporary hints on their local filesystem for later application +to the unavailable replica. Hints are an important way to help reduce the +duration of data inconsistency. Coordinators replay hints quickly after +unavailable replica nodes return to the ring. Hints are best effort, however, +and do not guarantee eventual consistency like :ref:`anti-entropy repair +<repair>` does. + +Hints are useful because of how Apache Cassandra replicates data to provide +fault tolerance, high availability and durability. Cassandra :ref:`partitions +data across the cluster <token-range>` using consistent hashing, and then +replicates keys to multiple nodes along the hash ring. To guarantee +availability, all replicas of a key can accept mutations without consensus, but +this means it is possible for some replicas to accept a mutation while others +do not. When this happens an inconsistency is introduced. + +Hints are one of the three ways, in addition to read-repair and +full/incremental anti-entropy repair, that Cassandra implements the eventual +consistency guarantee that all updates are eventually received by all replicas. +Hints, like read-repair, are best effort and not an alternative to performing +full repair, but they do help reduce the duration of inconsistency between +replicas in practice. + +Hinted Handoff +-------------- + +Hinted handoff is the process by which Cassandra applies hints to unavailable +nodes. + +For example, consider a mutation is to be made at ``Consistency Level`` +``LOCAL_QUORUM`` against a keyspace with ``Replication Factor`` of ``3``. +Normally the client sends the mutation to a single coordinator, who then sends +the mutation to all three replicas, and when two of the three replicas +acknowledge the mutation the coordinator responds successfully to the client. +If a replica node is unavailable, however, the coordinator stores a hint +locally to the filesystem for later application. New hints will be retained for +up to ``max_hint_window_in_ms`` of downtime (defaults to ``3 hours``). If the +unavailable replica does return to the cluster before the window expires, the +coordinator applies any pending hinted mutations against the replica to ensure +that eventual consistency is maintained. + +.. figure:: images/hints.svg + :alt: Hinted Handoff Example + + Hinted Handoff in Action + +* (``t0``): The write is sent by the client, and the coordinator sends it + to the three replicas. Unfortunately ``replica_2`` is restarting and cannot + receive the mutation. +* (``t1``): The client receives a quorum acknowledgement from the coordinator. + At this point the client believe the write to be durable and visible to reads + (which it is). +* (``t2``): After the write timeout (default ``2s``), the coordinator decides + that ``replica_2`` is unavailable and stores a hint to its local disk. +* (``t3``): Later, when ``replica_2`` starts back up it sends a gossip message + to all nodes, including the coordinator. +* (``t4``): The coordinator replays hints including the missed mutation + against ``replica_2``. + +If the node does not return in time, the destination replica will be +permanently out of sync until either read-repair or full/incremental +anti-entropy repair propagates the mutation. + +Application of Hints +^^^^^^^^^^^^^^^^^^^^ + +Hints are streamed in bulk, a segment at a time, to the target replica node and +the target node replays them locally. After the target node has replayed a +segment it deletes the segment and receives the next segment. This continues +until all hints are drained. + +Storage of Hints on Disk +^^^^^^^^^^^^^^^^^^^^^^^^ + +Hints are stored in flat files in the coordinator node’s +``$CASSANDRA_HOME/data/hints`` directory. A hint includes a hint id, the target +replica node on which the mutation is meant to be stored, the serialized +mutation (stored as a blob) that couldn't be delivered to the replica node, the +mutation timestamp, and the Cassandra version used to serialize the mutation. +By default hints are compressed using ``LZ4Compressor``. Multiple hints are +appended to the same hints file. + +Since hints contain the original unmodified mutation timestamp, hint application +is idempotent and cannot overwrite a future mutation. + +Hints for Timed Out Write Requests +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Hints are also stored for write requests that time out. The +``write_request_timeout_in_ms`` setting in ``cassandra.yaml`` configures the +timeout for write requests. + +:: + + write_request_timeout_in_ms: 2000 + +The coordinator waits for the configured amount of time for write requests to +complete, at which point it will time out and generate a hint for the timed out +request. The lowest acceptable value for ``write_request_timeout_in_ms`` is 10 ms. + + +Configuring Hints +----------------- + +Hints are enabled by default as they are critical for data consistency. The +``cassandra.yaml`` configuration file provides several settings for configuring +hints: + +Table 1. Settings for Hints + ++--------------------------------------------+-------------------------------------------+-------------------------------+ +|Setting | Description |Default Value | ++--------------------------------------------+-------------------------------------------+-------------------------------+ +|``hinted_handoff_enabled`` |Enables/Disables hinted handoffs | ``true`` | +| | | | +| | | | +| | | | +| | | | ++--------------------------------------------+-------------------------------------------+-------------------------------+ +|``hinted_handoff_disabled_datacenters`` |A list of data centers that do not perform | ``unset`` | +| |hinted handoffs even when handoff is | | +| |otherwise enabled. | | +| |Example: | | +| | | | +| | .. code-block:: yaml | | +| | | | +| | hinted_handoff_disabled_datacenters: | | +| | - DC1 | | +| | - DC2 | | ++--------------------------------------------+-------------------------------------------+-------------------------------+ +|``max_hint_window_in_ms`` |Defines the maximum amount of time (ms) | ``10800000`` # 3 hours | +| |a node shall have hints generated after it | | +| |has failed. | | ++--------------------------------------------+-------------------------------------------+-------------------------------+ +|``hinted_handoff_throttle_in_kb`` |Maximum throttle in KBs per second, per | | +| |delivery thread. This will be reduced | ``1024`` | +| |proportionally to the number of nodes in | | +| |the cluster. | | +| |(If there are two nodes in the cluster, | | +| |each delivery thread will use the maximum | | +| |rate; if there are 3, each will throttle | | +| |to half of the maximum,since it is expected| | +| |for two nodes to be delivering hints | | +| |simultaneously.) | | ++--------------------------------------------+-------------------------------------------+-------------------------------+ +|``max_hints_delivery_threads`` |Number of threads with which to deliver | ``2`` | +| |hints; Consider increasing this number when| | +| |you have multi-dc deployments, since | | +| |cross-dc handoff tends to be slower | | ++--------------------------------------------+-------------------------------------------+-------------------------------+ +|``hints_directory`` |Directory where Cassandra stores hints. |``$CASSANDRA_HOME/data/hints`` | +| | | | ++--------------------------------------------+-------------------------------------------+-------------------------------+ +|``hints_flush_period_in_ms`` |How often hints should be flushed from the | ``10000`` | +| |internal buffers to disk. Will *not* | | +| |trigger fsync. | | ++--------------------------------------------+-------------------------------------------+-------------------------------+ +|``max_hints_file_size_in_mb`` |Maximum size for a single hints file, in | ``128`` | +| |megabytes. | | ++--------------------------------------------+-------------------------------------------+-------------------------------+ +|``hints_compression`` |Compression to apply to the hint files. | ``LZ4Compressor`` | +| |If omitted, hints files will be written | | +| |uncompressed. LZ4, Snappy, and Deflate | | +| |compressors are supported. | | ++--------------------------------------------+-------------------------------------------+-------------------------------+ + +Configuring Hints at Runtime with ``nodetool`` +---------------------------------------------- + +``nodetool`` provides several commands for configuring hints or getting hints +related information. The nodetool commands override the corresponding +settings if any in ``cassandra.yaml`` for the node running the command. + +Table 2. Nodetool Commands for Hints + ++--------------------------------+-------------------------------------------+ +|Command | Description | ++--------------------------------+-------------------------------------------+ +|``nodetool disablehandoff`` |Disables storing and delivering hints | ++--------------------------------+-------------------------------------------+ +|``nodetool disablehintsfordc`` |Disables storing and delivering hints to a | +| |data center | ++--------------------------------+-------------------------------------------+ +|``nodetool enablehandoff`` |Re-enables future hints storing and | +| |delivery on the current node | ++--------------------------------+-------------------------------------------+ +|``nodetool enablehintsfordc`` |Enables hints for a data center that was | +| |previously disabled | ++--------------------------------+-------------------------------------------+ +|``nodetool getmaxhintwindow`` |Prints the max hint window in ms. New in | +| |Cassandra 4.0. | ++--------------------------------+-------------------------------------------+ +|``nodetool handoffwindow`` |Prints current hinted handoff window | ++--------------------------------+-------------------------------------------+ +|``nodetool pausehandoff`` |Pauses hints delivery process | ++--------------------------------+-------------------------------------------+ +|``nodetool resumehandoff`` |Resumes hints delivery process | ++--------------------------------+-------------------------------------------+ +|``nodetool |Sets hinted handoff throttle in kb | +|sethintedhandoffthrottlekb`` |per second, per delivery thread | ++--------------------------------+-------------------------------------------+ +|``nodetool setmaxhintwindow`` |Sets the specified max hint window in ms | ++--------------------------------+-------------------------------------------+ +|``nodetool statushandoff`` |Status of storing future hints on the | +| |current node | ++--------------------------------+-------------------------------------------+ +|``nodetool truncatehints`` |Truncates all hints on the local node, or | +| |truncates hints for the endpoint(s) | +| |specified. | ++--------------------------------+-------------------------------------------+ + +Make Hints Play Faster at Runtime +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The default of ``1024 kbps`` handoff throttle is conservative for most modern +networks, and it is entirely possible that in a simple node restart you may +accumulate many gigabytes hints that may take hours to play back. For example if +you are ingesting ``100 Mbps`` of data per node, a single 10 minute long +restart will create ``10 minutes * (100 megabit / second) ~= 7 GiB`` of data +which at ``(1024 KiB / second)`` would take ``7.5 GiB / (1024 KiB / second) = +2.03 hours`` to play back. The exact math depends on the load balancing strategy +(round robin is better than token aware), number of tokens per node (more +tokens is better than fewer), and naturally the cluster's write rate, but +regardless you may find yourself wanting to increase this throttle at runtime. + +If you find yourself in such a situation, you may consider raising +the ``hinted_handoff_throttle`` dynamically via the +``nodetool sethintedhandoffthrottlekb`` command. + +Allow a Node to be Down Longer at Runtime +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Sometimes a node may be down for more than the normal ``max_hint_window_in_ms``, +(default of three hours), but the hardware and data itself will still be +accessible. In such a case you may consider raising the +``max_hint_window_in_ms`` dynamically via the ``nodetool setmaxhintwindow`` +command added in Cassandra 4.0 (`CASSANDRA-11720 <https://issues.apache.org/jira/browse/CASSANDRA-11720>`_). +This will instruct Cassandra to continue holding hints for the down +endpoint for a longer amount of time. + +This command should be applied on all nodes in the cluster that may be holding +hints. If needed, the setting can be applied permanently by setting the +``max_hint_window_in_ms`` setting in ``cassandra.yaml`` followed by a rolling +restart. + +Monitoring Hint Delivery +------------------------ + +Cassandra 4.0 adds histograms available to understand how long it takes to deliver +hints which is useful for operators to better identify problems (`CASSANDRA-13234 +<https://issues.apache.org/jira/browse/CASSANDRA-13234>`_). + +There are also metrics available for tracking :ref:`Hinted Handoff <handoff-metrics>` +and :ref:`Hints Service <hintsservice-metrics>` metrics. diff --git a/doc/source/operating/images/hints.svg b/doc/source/operating/images/hints.svg new file mode 100644 index 0000000..5e952e7 --- /dev/null +++ b/doc/source/operating/images/hints.svg @@ -0,0 +1,9 @@ +<svg xmlns="http://www.w3.org/2000/svg" width="661.2000122070312" height="422.26666259765625" style=" + width:661.2000122070312px; + height:422.26666259765625px; + background: transparent; + fill: none; +"> + <svg xmlns="http://www.w3.org/2000/svg" class="role-diagram-draw-area"><g class="shapes-region" style="stroke: black; fill: none;"><g class="composite-shape"><path class="real" d=" M40,60 C40,43.43 53.43,30 70,30 C86.57,30 100,43.43 100,60 C100,76.57 86.57,90 70,90 C53.43,90 40,76.57 40,60 Z" style="stroke-width: 1px; stroke: rgb(0, 0, 0); fill: none;"/></g><g class="arrow-line"><path class="connection real" stroke-dasharray="" d=" M70,300 L70,387" style="stroke: rgb(0, 0, 0); s [...] + <svg xmlns="http://www.w3.org/2000/svg" width="660" height="421.066650390625" style="width:660px;height:421.066650390625px;font-family:Asana-Math, Asana;background:transparent;"><g><g><g style="transform:matrix(1,0,0,1,47.266693115234375,65.81666564941406);"><path d="M342 330L365 330C373 395 380 432 389 458C365 473 330 482 293 482C248 483 175 463 118 400C64 352 25 241 25 136C25 40 67 -11 147 -11C201 -11 249 9 304 54L354 95L346 115L331 105C259 57 221 40 186 40C130 40 101 80 101 15 [...] +</svg> diff --git a/doc/source/operating/metrics.rst b/doc/source/operating/metrics.rst index e87bd5a..fc37440 100644 --- a/doc/source/operating/metrics.rst +++ b/doc/source/operating/metrics.rst @@ -534,6 +534,8 @@ TotalHints Counter Number of hint messages written to thi TotalHintsInProgress Counter Number of hints attemping to be sent currently. ========================== ============== =========== +.. _handoff-metrics: + HintedHandoff Metrics ^^^^^^^^^^^^^^^^^^^^^ @@ -556,6 +558,8 @@ Hints_created-<PeerIP> Counter Number of hints on disk for this pee Hints_not_stored-<PeerIP> Counter Number of hints not stored for this peer, due to being down past the configured hint window. =========================== ============== =========== +.. _hintsservice-metrics: + HintsService Metrics ^^^^^^^^^^^^^^^^^^^^^ --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org