[jira] [Created] (CASSANDRA-9644) DTCS configuration proposals for handling consequences of repairs

Antti Nissinen (JIRA) Wed, 24 Jun 2015 00:51:12 -0700

Antti Nissinen created CASSANDRA-9644:
-----------------------------------------

Summary: DTCS configuration proposals for handling consequences of
repairs
Key: CASSANDRA-9644
URL: https://issues.apache.org/jira/browse/CASSANDRA-9644
Project: Cassandra
Issue Type: Improvement
Components: Core
Reporter: Antti Nissinen
Fix For: 3.x, 2.1.x
Attachments: node0_20150621_1646_time_graph.txt,
node0_20150621_2320_time_graph.txt, node0_20150623_1526_time_graph.txt,
node1_20150621_1646_time_graph.txt, node1_20150621_2320_time_graph.txt,
node1_20150623_1526_time_graph.txt, node2_20150621_1646_time_graph.txt,
node2_20150621_2320_time_graph.txt, node2_20150623_1526_time_graph.txt,
nodetool status infos.txt, sstable_compaction_trace.txt,
sstable_compaction_trace_snipped.txt, sstable_counts.jpg

This is a document bringing up some issues when DTCS is used to compact time
series data in a three node cluster. The DTCS is currently configured with a
few parameters that are making the configuration fairly simple, but might cause
problems in certain special cases like recovering from the flood of small
SSTables due to repair operation. We are suggesting some ideas that might be a
starting point for further discussions. Following sections are containing:

- Description of the cassandra setup
- Feeding process of the data
- Failure testing
- Issues caused by the repair operations for the DTCS
- Proposal for the DTCS configuration parameters

Attachments are included to support the discussion and there is a separate
section giving explanation for those.

Cassandra setup and data model

- Cluster is composed from three nodes running Cassandra 2.1.2. Replication
factor is two and read and write consistency levels are ONE.
- Data is time series data. Data is saved so that one row contains a certain
time span of data for a given metric ( 20 days in this case). The row key
contains information about the start time of the time span and metrix name.
Column name gives the offset from the beginning of time span. Column time stamp
is set to correspond time stamp when adding together the timestamp from the row
key and the offset (the actual time stamp of data point). Data model is analog
to KairosDB implementation.
- Average sampling rate is 10 seconds varying significantly from metric to
metric.
- 100 000 metrics are fed to the Cassandra.
- max_sstable_age_days is set to 5 days (objective is to keep SStable files in
manageable size, around 50 GB)
- TTL is not in use in the test.

Procedure for the failure test.

- Data is first dumped to Cassandra for 11 days and the data dumping is stopped
so that DTCS will have a change to finish all compactions. Data is dumped with
"fake timestamps" so that column time stamp is set when data is written to
Cassandra.
- One of the nodes is taken down and new data is dumped on top of the earlier
data covering couple of hours worth of data (faked time stamps).
- Dumping is stopped and the node is kept down for few hours.
- Node is taken up and the "nodetool repair" is applied on the node that was
down.

Consequences

- Repair operation will lead to massive amount of new SStables far back in the
history. New SStables are covering similar time spans than the files that were
created by DTCS before the shutdown of one of the nodes.
- To be able to compact the small files the max_sstable_age_days should be
increased to allow compaction to handle the files. However, the in a practical
case the time window will increase so large that generated files will be huge
that is not desirable. The compaction also combines together one very large
file with a bunch of small files in several phases that is not effective.
Generating really large files may also lead to out of disc space problems.
- See the list of time graphs later in the document.

Improvement proposals for the DTCS configuration

Below is a list of desired properties for the configuration. Current parameters
are mentioned if available.

- Initial window size (currently:base_time_seconds)
- The amount of similar size windows for the bucketing (currently:
min_threshold)
- The multiplier for the window size when increased (currently: min_threshold).
This we would like to be independent from the min_threshold parameter so that
you could actually control the rate how fast the window size is increased.
- Maximum length of the time window inside which the files are assigned for a
certain bucket (not currently defined). This means that expansion of time
window length is restricted. When the limit is reached the window size will be
same all the way back in the history (e.g. one week)
- The maximum horizon in which SStables are candidates for buckets (currently:
max_sstable_age_days)
- Maximum file size of SStable allowed to be in a set of files to be compacted
(not possible currently). Preventing out of disk space situations.
- Optional strategies to select the most interesting bucket:
- Minimum amount of SStables in the time window before it is a candidate
for the most interesting bucket (currently: min_threshold for the most recent
window, otherwise two). Being able set this value independently would allow to
put most of the efforts on those areas where a large amount of small files
should be compacted together instead of few new files.
- Optionally, the criteria for the most interesting bucket could be set:
e.g. select the window with most files to be compacted.
- Inside the bucket when the amount of files is limited by max_threshold,
the compaction would select first small files instead of one huge file and a
bunch of small files.

The above set of parameters allows to recover from repair operations producing
large amount of small SStables.

- Maximum length of the time window for compactions would keep the compacted
SStable size in reasonable range and would allow to extend the horizon far back
in the history
- Combining small files together instead of combining one huge file with e.g.
31 small files again and again is more disk efficient

In addition to the previous advantages the above parameters would also allow:

- Dumping of more data in the history (e.g. new metrics) by assigning the
correct timestamp for the column (fake time stamp) and proper compaction of new
and existing SStables.
- Expiring reasonable size SStable with TTL even if the compactions would be
intermittently executed far back in the history. In this case the new data has
to fed with TTL calculated dynamically.
- Note: Being able to give the absolute time stamp for the column expiry time
would be beneficial when data is dumped back in the history. This is the case
when you move data from some legacy system to Cassandra with faked time stamps
and would like to keep the data only a certain time period. Currently the
absolute time stamp is calculated by Cassandra from the system time and given
TTL. TTL has to be calculated dynamically based on the current time and desired
expiry moment making things more complex.

One interesting question is that why those duplicate SStable files are created?
The duplication problem could not be produced when the data was dumped with
following spec:

- 100 metrics
- 20 days of data in one row
- one year of data
- max_sstable_age_days = 15
- memtable_offheap_space_in_mb was decreased so that small SStables were
created (to create something to be compacted)
- One node was taken down and one more day of data was dumped on top of the
earlier data
- "nodetool repair -pr" was executed on each node => duplicates were checked in
each step => no duplicates
- "nodetool repair" was executed on a node that was down => no duplicates were
generated

--------------------------------------------------------------------------------------------------------------------

ATTACHMENTS

Time graphs of content of SSTables from different phases of the test run:

*************************************************************

Fields in the below time graphs are following:

- Order number from the SSTable file name
- Minimum column timestamp in the SSTable file
- Timespan representation graphically
- Maximum column time stamp in SStable
- The size of the SStable in megabytes

Time graphs after dumping the 11 days of data and letting all compactions to
run through

node0_20150621_1646_time_graph.txt
node1_20150621_1646_time_graph.txt
node2_20150621_1646_time_graph.txt (error: same as for node1, but the behavior
is same)

Time graphs after taking one node down (node2) and dumping couple of hours of
mode data

node0_20150621_2320_time_graph.txt
node1_20150621_2320_time_graph.txt
node2_20150621_2320_time_graph.txt

Time graphs when the repair operation has finished and compactions are done.
Compactions will naturally handle only the files inside the
max_sstable_age_days range.

==> Now there is a large amount of small files covering pretty much same areas
as the original SStables

node0_20150623_1526_time_graph.txt
node1_20150623_1526_time_graph.txt
node2_20150623_1526_time_graph.txt

-----------
Trend from the SStable count as a function of time on each node.
sstable_counts.jpg

Vertical lines:

1) Clearing the database and dumping the 11 days worth of data
2) Stopping the dumping and letting compactions run
3) Taking one node down (top bottom one in figure) and dumping few hours of new
data on top of earlier data
4) Starting the repair operation
5) Repair operation finished

--------
Nodetool status prints before and after repair operation
nodetool status infos.txt

--------------
Tracing compactions

Log files were parsed to demonstrate the creation of new small SStables and the
combination of one large file with a bunch of small ones. This is done from the
time range where the max_sstable_age_days is able to reach (in this case 5
days). The hierarchy of the files is shown in the file
"sstable_compaction_trace.txt". The first flushed file can be found from the
line 10.

Each line represents either a flushed SStable or the SStable created by DTCS.
For flushed files the timestamp indicates the time period the file represents.
For compacted files (marked with C) the first timestamp represents the moment
when the compaction was done (wall clock). Time stamps are faked when written
to the database. The size of the file is the last field. The first field with
number in parenthesis shows the level of the file. Top level files marked with
(0) are those that don't have any predecessors and should be found from the
disk also.

SStables that are created by the repair operation are not mentioned in the log
files so they are handled as phantom files. The existence of file can be
concluded from the predecessor list of compacted SStable. Those are marked with
None,None in timestamps.

In the file "sstable_compaction_trace_snipped.txt" is one portion that shows
the compaction hierarchy for the small files originating from the repair
operation. max_threshold is in the default value of 32. In each step 31 tiny
files are compacted together with 46 GB file.

sstable_compaction_trace.txt
sstable_compaction_trace_snipped.txt

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CASSANDRA-9644) DTCS configuration proposals for handling consequences of repairs

Reply via email to