Antti Nissinen created CASSANDRA-9644:
-----------------------------------------

             Summary: DTCS configuration proposals for handling consequences of 
repairs
                 Key: CASSANDRA-9644
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9644
             Project: Cassandra
          Issue Type: Improvement
          Components: Core
            Reporter: Antti Nissinen
             Fix For: 3.x, 2.1.x
         Attachments: node0_20150621_1646_time_graph.txt, 
node0_20150621_2320_time_graph.txt, node0_20150623_1526_time_graph.txt, 
node1_20150621_1646_time_graph.txt, node1_20150621_2320_time_graph.txt, 
node1_20150623_1526_time_graph.txt, node2_20150621_1646_time_graph.txt, 
node2_20150621_2320_time_graph.txt, node2_20150623_1526_time_graph.txt, 
nodetool status infos.txt, sstable_compaction_trace.txt, 
sstable_compaction_trace_snipped.txt, sstable_counts.jpg

This is a document bringing up some issues when DTCS is used to compact time 
series data in a three node cluster. The DTCS is currently configured with a 
few parameters that are making the configuration fairly simple, but might cause 
problems in certain special cases like recovering from the flood of small 
SSTables due to repair operation. We are suggesting some ideas that might be a 
starting point for further discussions. Following sections are containing:

- Description of the cassandra setup
- Feeding process of the data
- Failure testing
- Issues caused by the repair operations for the DTCS
- Proposal for the DTCS configuration parameters

Attachments are included to support the discussion and there is a separate 
section giving explanation for those.

Cassandra setup and data model

- Cluster is composed from three nodes running Cassandra 2.1.2. Replication 
factor is two and read and write consistency levels are ONE.
- Data is time series data. Data is saved so that one row contains a certain 
time span of data for a given metric ( 20 days in this case). The row key 
contains information about the start time of the time span and metrix name. 
Column name gives the offset from the beginning of time span. Column time stamp 
is set to correspond time stamp when adding together the timestamp from the row 
key and the offset (the actual time stamp of data point). Data model is analog 
to KairosDB implementation.
- Average sampling rate is 10 seconds varying significantly from metric to 
metric.
- 100 000 metrics are fed to the Cassandra.
- max_sstable_age_days is set to 5 days (objective is to keep SStable files in 
manageable size, around 50 GB)
- TTL is not in use in the test.

Procedure for the failure test.

- Data is first dumped to Cassandra for 11 days and the data dumping is stopped 
so that DTCS will have a change to finish all compactions. Data is dumped with 
"fake timestamps" so that column time stamp is set when data is written to 
Cassandra.
- One of the nodes is taken down and new data is dumped on top of the earlier 
data covering couple of hours worth of data (faked time stamps).
- Dumping is stopped and the node is kept down for few hours.
- Node is taken up and the "nodetool repair" is applied on the node that was 
down.

Consequences

- Repair operation will lead to massive amount of new SStables far back in the 
history. New SStables are covering similar time spans than the files that were 
created by DTCS before the shutdown of one of the nodes.
- To be able to compact the small files the max_sstable_age_days should be 
increased to allow compaction to handle the files. However, the in a practical 
case the time window will increase so large that generated files will be huge 
that is not desirable. The compaction also combines together one very large 
file with a bunch of small files in several phases that is not effective. 
Generating really large files may also lead to out of disc space problems.
- See the list of time graphs later in the document.

Improvement proposals for the DTCS configuration

Below is a list of desired properties for the configuration. Current parameters 
are mentioned if available.

- Initial window size (currently:base_time_seconds)
- The amount of similar size windows for the bucketing (currently: 
min_threshold)
- The multiplier for the window size when increased (currently: min_threshold). 
This we would like to be independent from the min_threshold parameter so that 
you could actually control the rate how fast the window size is increased.
- Maximum length of the time window inside which the files are assigned for a 
certain bucket (not currently defined). This means that expansion of time 
window length is restricted. When the limit is reached the window size will be 
same all the way back in the history (e.g. one week)
- The maximum horizon in which SStables are candidates for buckets (currently: 
max_sstable_age_days)
- Maximum file size of SStable allowed to be in a set of files to be compacted 
(not possible currently). Preventing out of disk space situations.
- Optional strategies to select the most interesting bucket:
    - Minimum amount of SStables in the time window before it is a candidate 
for the most interesting bucket (currently: min_threshold for the most recent 
window, otherwise two). Being able set this value independently would allow to 
put most of the efforts on those areas where a large amount of small files 
should be compacted together instead of few new files.
    - Optionally, the criteria for the most interesting bucket could be set: 
e.g. select the window with most files to be compacted.
    - Inside the bucket when the amount of files is limited by max_threshold, 
the compaction would select first small files instead of one huge file and a 
bunch of small files.

The above set of parameters allows to recover from repair operations producing 
large amount of small SStables.

- Maximum length of the time window for compactions would keep the compacted 
SStable size in reasonable range and would allow to extend the horizon far back 
in the history
- Combining small files together instead of combining one huge file with e.g. 
31 small files again and again is more disk efficient

In addition to the previous advantages the above parameters would also allow:

- Dumping of more data in the history (e.g. new metrics) by assigning the 
correct timestamp for the column (fake time stamp) and proper compaction of new 
and existing SStables.
- Expiring reasonable size SStable with TTL even if the compactions would be 
intermittently executed far back in the history. In this case the new data has 
to fed with TTL calculated dynamically.
- Note: Being able to give the absolute time stamp for the column expiry time 
would be beneficial when data is dumped back in the history. This is the case 
when you move data from some legacy system to Cassandra with faked time stamps 
and would like to keep the data only a certain time period.  Currently the 
absolute time stamp is calculated by Cassandra from the system time and given 
TTL. TTL has to be calculated dynamically based on the current time and desired 
expiry moment making things more complex.

One interesting question is that why those duplicate SStable files are created? 
The duplication problem could not be produced when the data was dumped with 
following spec:

- 100 metrics
- 20 days of data in one row
- one year of data
- max_sstable_age_days = 15
 - memtable_offheap_space_in_mb was decreased so that small SStables were 
created (to create something to be compacted)
 - One node was taken down and one more day of data was dumped on top of the 
earlier data
- "nodetool repair -pr" was executed on each node => duplicates were checked in 
each step => no duplicates
- "nodetool repair" was executed on a node that was down => no duplicates were 
generated


--------------------------------------------------------------------------------------------------------------------

ATTACHMENTS


Time graphs of content of SSTables from different phases of the test run:

*************************************************************

Fields in the below time graphs are following:

- Order number from the  SSTable file name
- Minimum column timestamp in the SSTable file
- Timespan representation graphically
- Maximum column time stamp in SStable
- The size of the SStable in megabytes

Time graphs after dumping the 11 days of data and letting all compactions to 
run through

node0_20150621_1646_time_graph.txt
node1_20150621_1646_time_graph.txt
node2_20150621_1646_time_graph.txt (error: same as for node1, but the behavior 
is same)

Time graphs after taking one node down (node2) and dumping couple of hours of 
mode data

node0_20150621_2320_time_graph.txt
node1_20150621_2320_time_graph.txt
node2_20150621_2320_time_graph.txt 

Time graphs when the repair operation has finished and compactions are done. 
Compactions will naturally handle only the files inside the 
max_sstable_age_days range.

==> Now there is a large amount of small files covering pretty much same areas 
as the original SStables

node0_20150623_1526_time_graph.txt
node1_20150623_1526_time_graph.txt
node2_20150623_1526_time_graph.txt

-----------
Trend from the SStable count as a function of time on each node.
sstable_counts.jpg

Vertical lines:

1) Clearing the database and dumping the 11 days worth of data
2) Stopping the dumping and letting compactions run
3) Taking one node down (top bottom one in figure) and dumping few hours of new 
data on top of earlier data
4) Starting the repair operation
5) Repair operation finished


--------
Nodetool status prints before and after repair operation
nodetool status infos.txt

--------------
Tracing compactions

Log files were parsed to demonstrate the creation of new small SStables and the 
combination of one large file with a bunch of small ones. This is done from the 
time range where the max_sstable_age_days is able to reach (in this case 5 
days). The hierarchy of the files is shown in the file 
"sstable_compaction_trace.txt". The first flushed file can be found from the 
line 10.

Each line represents either a flushed SStable or the SStable created by DTCS. 
For flushed files the timestamp indicates the time period the file represents. 
For compacted files (marked with C) the first timestamp represents the moment 
when the compaction was done (wall clock). Time stamps are faked when written 
to the database. The size of the file is the last field. The first field with 
number in parenthesis shows the level of the file. Top level files marked with 
(0) are those that don't have any predecessors and should be found from the 
disk also.

SStables that are created by the repair operation are not mentioned in the log 
files so they are handled as phantom files. The existence of file can be 
concluded from the predecessor list of compacted SStable. Those are marked with 
None,None in timestamps.

In the file "sstable_compaction_trace_snipped.txt" is one portion that shows 
the compaction hierarchy for the small files originating from the repair 
operation. max_threshold is in the default value of 32. In each step 31 tiny 
files are compacted together with 46 GB file.


sstable_compaction_trace.txt
sstable_compaction_trace_snipped.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to