[ 
https://issues.apache.org/jira/browse/CASSANDRA-12268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15395181#comment-15395181
 ] 

Sylvain Lebresne commented on CASSANDRA-12268:
----------------------------------------------

bq. This can also affect normal MV operations, for example when we issue a 
partition deletion on a very large partition.

If I'm not mistaken, the problem is however much easier to fix for rebuild than 
for partition deletion.

The reason it's hard for partition deletion (or range tombsone) over large 
partition is that the basic guarantee of MVs is eventual consistency, and that 
guarantee relies on guaranteeing that if a base update is persisted, then all 
the corresponding MV updates are too, which we guarantee using a logged batch. 
Problem is, a partition deletion is a single base table operation that can have 
*many* associated MV updates, and we _need_ to have those in the same batch log 
to guarantee eventual consistency.

We don't have this problem for rebuild, rebuild doesn't insert anything in the 
base table. In fact, I'm not even sure using the batchlog for rebuild brings us 
much. So we can "flush" the MV updates for a key when we've accumulated more 
than some amount of updates and if we fail in the middle of a key, that's fine, 
we'll just retry that key anyway when rebuild is restarted.

So I suggest we only fix the rebuild problem here and postpone the partition 
deletion problem to another ticket. I'll note that "I think" that latter 
problem can be fixed (without losing our basic guarantee) if we split a base 
table partition deletion in an equivalent number of smaller range tombstones 
(of appropriate size). We could then deal with each sub-range separately 
without losing the base-table-vs-MV eventual consistency guarantee. That's a 
pretty involved things to do however.


> Make MV Index creation robust for wide referent rows
> ----------------------------------------------------
>
>                 Key: CASSANDRA-12268
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12268
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jonathan Shook
>            Assignee: Carl Yeksigian
>
> When creating an index for a materialized view for extant data, heap pressure 
> is very dependent on the cardinality of of rows associated with each index 
> value. With the way that per-index value rows are created within the index, 
> this can cause unbounded heap pressure, which can cause OOM. This appears to 
> be a side-effect of how each index row is applied atomically as with batches.
> The commit logs can accumulate enough during the process to prevent the node 
> from being restarted. Given that this occurs during global index creation, 
> this can happen on multiple nodes, making stable recovery of a node set 
> difficult, as co-replicas become unavailable to assist in back-filling data 
> from commitlogs.
> While it is understandable that you want to avoid having relatively wide rows 
>  even in materialized views, this represents a particularly difficult 
> scenario for triage.
> The basic recommendation for improving this is to sub-group the index 
> creation into smaller chunks internally, providing a maximal bound against 
> the heap pressure when it is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to