[ https://issues.apache.org/jira/browse/CASSANDRA-3929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13593435#comment-13593435 ]
Sylvain Lebresne commented on CASSANDRA-3929: --------------------------------------------- I have to say that I'm a bit unconfortable with that patch/ticket. My problem is, it is not very easy to understand what that feature actually does for a end user, and provided said user does deletes, the behavior becomes pretty much random. Let's ignore deletions first and let get ourselves in the feet of a user. That option is supposed to impose a row size limit. So say N = 2 and I insert (not at the same time, nor necessarily in that order) columns A, B and C. Since I cap the row at 2, if I do a full row read that's what I well [A, B]. So the row contains only A and B, right! But what if I do a slice(B, "")? Then it depends: I may get [B], but I can also get [B, C] (because maybe flush happens so that [A, B] ends up in one sstable, and [C] in another, so that C is still here internally, and the slice will have no way to know that it shouldn't return C because C is over the row size limit). And that heavily depend on internal timing: maybe I'll get [B, C] but if I try one second later I'll get [B] because compaction has kicked in. So, what gives? Adding deletion makes that even worst. If you start doing deletes, depending on the timing of flush/compaction, you may or may not even get the N first column you've inserted in the row (typically, in Fabien's example above, if you change when flush occurs, even with the last patch attached, you may either get [A, C] (which is somewhat wrong really) or [A, C, D]). I also want to mention that because compaction/flush don't happen synchronously on all replica, there is a high change that even if replica are consistent, their actual sstable content differs, meaning that this probably break repair fairly badly. Let's be clear. I'm not saying that feature cannot be useful. But I'm saying this is a bit of hack whose semantic depends on internal timing of operations, not a feature with a cleanly defined semantic. That's why I said earlier that I always though this would make a good externally contributed compaction strategy, but a priori feels a bit too hacky for core Cassandra imo. I haven't made up my mind completely yet, but I wanted to voice my concern first and see what other thinks. And I have to say that if we do go ahead with that feature in core Cassandra, I'd be in favor of disabling deletes on CF that have that option set, because imo throwing deletes in the mix makes things too unpredicatable to be really useful. > Support row size limits > ----------------------- > > Key: CASSANDRA-3929 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3929 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Jonathan Ellis > Assignee: Dave Brosius > Priority: Minor > Labels: ponies > Fix For: 2.0 > > Attachments: 3929_b.txt, 3929_c.txt, 3929_d.txt, 3929_e.txt, > 3929_f.txt, 3929_g_tests.txt, 3929_g.txt, 3929.txt > > > We currently support expiring columns by time-to-live; we've also had > requests for keeping the most recent N columns in a row. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira