[ 
https://issues.apache.org/jira/browse/CASSANDRA-3929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13593435#comment-13593435
 ] 

Sylvain Lebresne commented on CASSANDRA-3929:
---------------------------------------------

I have to say that I'm a bit unconfortable with that patch/ticket.

My problem is, it is not very easy to understand what that feature actually 
does for a end user, and provided said user does deletes, the behavior becomes 
pretty much random.

Let's ignore deletions first and let get ourselves in the feet of a user.

That option is supposed to impose a row size limit. So say N = 2 and I insert 
(not at the same time, nor necessarily in that order) columns A, B and C. Since 
I cap the row at 2, if I do a full row read that's what I well [A, B]. So the 
row contains only A and B, right! But what if I do a slice(B, "")? Then it 
depends: I may get [B], but I can also get [B, C] (because maybe flush happens 
so that [A, B] ends up in one sstable, and [C] in another, so that C is still 
here internally, and the slice will have no way to know that it shouldn't 
return C because C is over the row size limit). And that heavily depend on 
internal timing: maybe I'll get [B, C] but if I try one second later I'll get 
[B] because compaction has kicked in. So, what gives?

Adding deletion makes that even worst. If you start doing deletes, depending on 
the timing of flush/compaction, you may or may not even get the N first column 
you've inserted in the row (typically, in Fabien's example above, if you change 
when flush occurs, even with the last patch attached, you may either get [A, C] 
(which is somewhat wrong really) or [A, C, D]).

I also want to mention that because compaction/flush don't happen synchronously 
on all replica, there is a high change that even if replica are consistent, 
their actual sstable content differs, meaning that this probably break repair 
fairly badly.

Let's be clear. I'm not saying that feature cannot be useful. But I'm saying 
this is a bit of hack whose semantic depends on internal timing of operations, 
not a feature with a cleanly defined semantic. That's why I said earlier that I 
always though this would make a good externally contributed compaction 
strategy, but a priori feels a bit too hacky for core Cassandra imo. I haven't 
made up my mind completely yet, but I wanted to voice my concern first and see 
what other thinks. And I have to say that if we do go ahead with that feature 
in core Cassandra, I'd be in favor of disabling deletes on CF that have that 
option set, because imo throwing deletes in the mix makes things too 
unpredicatable to be really useful.

                
> Support row size limits
> -----------------------
>
>                 Key: CASSANDRA-3929
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3929
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Dave Brosius
>            Priority: Minor
>              Labels: ponies
>             Fix For: 2.0
>
>         Attachments: 3929_b.txt, 3929_c.txt, 3929_d.txt, 3929_e.txt, 
> 3929_f.txt, 3929_g_tests.txt, 3929_g.txt, 3929.txt
>
>
> We currently support expiring columns by time-to-live; we've also had 
> requests for keeping the most recent N columns in a row.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to