[ 
https://issues.apache.org/jira/browse/CASSANDRA-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14615060#comment-14615060
 ] 

Jeff Jirsa commented on CASSANDRA-9666:
---------------------------------------

{quote}
We should not provide 2 different compaction strategies for time series data, 
instead we should try to fix the issues you list in DTCS
{quote}

I'm willing to consider that premise, however, doing so will require the 
addition of a number of tunable variables, turning DTCS into a complex, 
unwieldy beast. We both agree that "DTCS has proven quite difficult to use", 
and adding complexity to an already complex situation seems like a dangerous 
path. It's already very, very difficult to tune, and the difficulty is tied to 
the windowing and tiering algorithms - adding MORE knobs to those same 
algorithms seems unlikely to make it easier to operate in production.

{quote}
We should do STCS within all windows, CASSANDRA-9644
{quote}

When the window is above max_threshold, potentially. Right now TWCS sorts older 
windows by filesize, which has the effect of creating a pseudo-STCS behavior to 
make sure to get total number of sstables in the later windows down to 1 as 
quickly as possible. Doing a pass with STCS isn't necessarily guaranteed to 
knock us below max_threshold (unless, perhaps, we call it with some non-default 
thresholds to make it more aggressive). We can look at whether or not STCS is 
actually productive, but right now I'm faking STCS in TWCS past windows. 

{quote}
This will really hurt read performance if we do this, you will end up hitting 
all sstables on disk for every read. The real fix would be to either flush to 
several sstables (split by timestamps), or, doing a first special compaction 
that writes several sstables into the correct windows.
{quote}

I wasnt about to submit a change to the flushing code, I'm not nearly familiar 
enough to consider doing that. I wouldn't mind seeing it, but it's probably 
above my head.

However, I expected to write a splitting major for TWCS to split into the 
correct window. I will write it in advance if you think it addresses a critical 
problem that fundamentally breaks TWCS. 

I also believe the same problem also exists in DTCS: any foreground read repair 
has the exact same characteristics of throwing off sstable timestamps and 
potentially causing reads to hit multiple sstables. In DTCS, it's magnified by 
the fact that it ALSO throws off sstable compaction after the fact. 

{quote}
unsure what you mean here, could you elaborate?
{quote}

Basically, unlike DTCS, if an old sstable DOES END UP beyond the window we 
expect it (max_sstable_age_days), we still compact it. This isn't limited to 
repair - it can also happen due to boostrap and decommission, important 
operations that real clusters do regularly, even those holding time series 
data.  Resulting sstables will be compacted regardless of whether or not STCS 
would chose it as a candidate, by sorting all files in that window smallest to 
largest (again, windows based on max timestamp), and choosing up to 
max_threshold. DTCS uses timestamps to sort, which then has a very likely 
chance of compacting biggest files together first, leaving small files to live 
far longer than they should, and - in many cases - causing the repeated 
recompaction of the original large sstables.

{quote}
Could you elaborate here? How do you avoid old data getting into the windows?
{quote}

You don't - we accept that we'll get data into old windows, and make sure we 
compact it rather than punting with max_sstable_age_days. Instead, we just make 
sure we always use max timestamp for placing files into a window, and 
consistent epoch-based windows, to avoid surprises caused by having old data 
into the windows. Conversely, DTCS pretends like you can prevent this from 
happening, when it's not actually possible in a real world cluster, and then 
gets confused wildly confused. 

{quote}
Unsure what the issue with DTCS is here (re: streaming operations)?
{quote}

1) Streaming operations beyond max_sstable_age_days are awful in DTCS for 
reasons that should be fairly obvious. Not just repair, all streaming 
(bootstrap/decom/bulk load). 
2) Raising max_sstable_age_days may not actually be viable for most people 
because the tiering then joins adjacent windows beyond what they expect/desire 
(not only does it cause a ton of IO for new compactions, in many cases it will 
run servers out of disk trying to get to the expected state)

{quote}
No, but as mentioned above, this will confuse users in other ways, by making 
reads very slow
{quote}

Again, I believe in most real clusters, this is also true with DTCS. 


> Provide an alternative to DTCS
> ------------------------------
>
>                 Key: CASSANDRA-9666
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9666
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jeff Jirsa
>            Assignee: Jeff Jirsa
>             Fix For: 2.1.x, 2.2.x
>
>
> DTCS is great for time series data, but it comes with caveats that make it 
> difficult to use in production (typical operator behaviors such as bootstrap, 
> removenode, and repair have MAJOR caveats as they relate to 
> max_sstable_age_days, and hints/read repair break the selection algorithm).
> I'm proposing an alternative, TimeWindowCompactionStrategy, that sacrifices 
> the tiered nature of DTCS in order to address some of DTCS' operational 
> shortcomings. I believe it is necessary to propose an alternative rather than 
> simply adjusting DTCS, because it fundamentally removes the tiered nature in 
> order to remove the parameter max_sstable_age_days - the result is very very 
> different, even if it is heavily inspired by DTCS. 
> Specifically, rather than creating a number of windows of ever increasing 
> sizes, this strategy allows an operator to choose the window size, compact 
> with STCS within the first window of that size, and aggressive compact down 
> to a single sstable once that window is no longer current. The window size is 
> a combination of unit (minutes, hours, days) and size (1, etc), such that an 
> operator can expect all data using a block of that size to be compacted 
> together (that is, if your unit is hours, and size is 6, you will create 
> roughly 4 sstables per day, each one containing roughly 6 hours of data). 
> The result addresses a number of the problems with 
> DateTieredCompactionStrategy:
> - At the present time, DTCS’s first window is compacted using an unusual 
> selection criteria, which prefers files with earlier timestamps, but ignores 
> sizes. In TimeWindowCompactionStrategy, the first window data will be 
> compacted with the well tested, fast, reliable STCS. All STCS options can be 
> passed to TimeWindowCompactionStrategy to configure the first window’s 
> compaction behavior.
> - HintedHandoff may put old data in new sstables, but it will have little 
> impact other than slightly reduced efficiency (sstables will cover a wider 
> range, but the old timestamps will not impact sstable selection criteria 
> during compaction)
> - ReadRepair may put old data in new sstables, but it will have little impact 
> other than slightly reduced efficiency (sstables will cover a wider range, 
> but the old timestamps will not impact sstable selection criteria during 
> compaction)
> - Small, old sstables resulting from streams of any kind will be swiftly and 
> aggressively compacted with the other sstables matching their similar 
> maxTimestamp, without causing sstables in neighboring windows to grow in size.
> - The configuration options are explicit and straightforward - the tuning 
> parameters leave little room for error. The window is set in common, easily 
> understandable terms such as “12 hours”, “1 Day”, “30 days”. The 
> minute/hour/day options are granular enough for users keeping data for hours, 
> and users keeping data for years. 
> - There is no explicitly configurable max sstable age, though sstables will 
> naturally stop compacting once new data is written in that window. 
> - Streaming operations can create sstables with old timestamps, and they'll 
> naturally be joined together with sstables in the same time bucket. This is 
> true for bootstrap/repair/sstableloader/removenode. 
> - It remains true that if old data and new data is written into the memtable 
> at the same time, the resulting sstables will be treated as if they were new 
> sstables, however, that no longer negatively impacts the compaction 
> strategy’s selection criteria for older windows. 
> Patch provided for both 2.1 ( 
> https://github.com/jeffjirsa/cassandra/commits/twcs-2.1 ) and 2.2 ( 
> https://github.com/jeffjirsa/cassandra/commits/twcs )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to