[ 
https://issues.apache.org/jira/browse/CASSANDRA-13299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16180923#comment-16180923
 ] 

Paulo Motta commented on CASSANDRA-13299:
-----------------------------------------

Thanks for the updates, this is looking good! I managed to reproduce an OOM 
when repairing a wide partition with 100K rows and verified that this patch 
avoids the OOM by splitting the partition in multiple batches (found 
CASSANDRA-13899 on the way). Awesome job!

While splitting on the happy case is working nicely, we need to ensure the 
range tombstone handling (specially range deletions) is working correctly and 
well tested before committing this.

I noticed that the previous {{ThrottledUnfilteredIterator}} implementation 
could [potentially 
return|https://github.com/jasonstack/cassandra/blob/b8cb49035d3cf77198b31df7c51a174fffe3edaf/src/java/org/apache/cassandra/db/rows/ThrottledUnfilteredIterator.java#L166]
 {{throttle+2}} unfiltereds, differently from the documentation which states 
that the maximum number of unfiltereds per batch is {{throttle+1}}. I also 
noticed that the case when there is a row between two markers was not being 
tested by existing tests, since we need a range deletion to reproduce this 
scenario. I fixed this and added more tests on [this 
commit|https://github.com/pauloricardomg/cassandra/commit/47d8ca3592cb6382bb4c308720646395306a0a69].
 Could you also modify 
[complexThrottleWithTombstoneTest|https://github.com/jasonstack/cassandra/commit/b8cb49035d3cf77198b31df7c51a174fffe3edaf#diff-5162644c24391628b339b88c3619427cR66]
 to test range deletions?

The previous change 
[requires|https://github.com/pauloricardomg/cassandra/commit/47d8ca3592cb6382bb4c308720646395306a0a69#diff-2acee8fea5cd82a51fda4af6e38faf13R60]
 the minimum throttle size to be 2, otherwise it would not be possible to make 
progress on the iterator in the presence of open and close markers.

I think that instead of throwing an {{AssertionError}} when the returned 
iterator is not exhausted, we could simply exhaust it, effectively skipping 
entries, since this might be a possible usage of 
{{ThrottledUnfilteredIterator}} so I did this on [this 
commit|https://github.com/pauloricardomg/cassandra/commit/04ed5ecb5183195601950fc9efd2ca9123596487].

I also added an utility method 
{{ThrottledUnfilteredIterator.throttle(UnfilteredPartitionIterator 
partitionIterator, int maxBatchSize)}} to allow throttling an 
{{UnfilteredPartitionIterator}} transparently and used that on 
{{StreamReceiveTask}} [on this 
commit|https://github.com/pauloricardomg/cassandra/commit/4f8c3b8faa2644133d301ac7bf7b748f7ec265ee].

I had another look at the {{throttled_partition_update_test}} 
[dtest|https://github.com/riptano/cassandra-dtest/commit/f3307adef349f232ec0ae64e902164684f32cca0]
 and think we can make the following improvements:
* Right now we're [verifying the 
results|https://github.com/riptano/cassandra-dtest/commit/f3307adef349f232ec0ae64e902164684f32cca0#diff-62ba429edee6a4681782f078246c9893R1410]
 with all the nodes UP, but it's possible that another node responds the query 
even though one of the inconsistent nodes did not stream correctly. I think we 
should check the results on each node individually (with the others down) to 
ensure they streamed data correctly from other nodes.
*  Add [range deletions|https://issues.apache.org/jira/browse/CASSANDRA-6237] 
since that's when the range tombstones special cases will be properly exercised.

Please let me know what do you think about these suggestions.

> Potential OOMs and lock contention in write path streams
> --------------------------------------------------------
>
>                 Key: CASSANDRA-13299
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13299
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Materialized Views
>            Reporter: Benjamin Roth
>            Assignee: ZhaoYang
>             Fix For: 4.x
>
>
> I see a potential OOM, when a stream (e.g. repair) goes through the write 
> path as it is with MVs.
> StreamReceiveTask gets a bunch of SSTableReaders. These produce rowiterators 
> and they again produce mutations. So every partition creates a single 
> mutation, which in case of (very) big partitions can result in (very) big 
> mutations. Those are created on heap and stay there until they finished 
> processing.
> I don't think it is necessary to create a single mutation for each partition. 
> Why don't we implement a PartitionUpdateGeneratorIterator that takes a 
> UnfilteredRowIterator and a max size and spits out PartitionUpdates to be 
> used to create and apply mutations?
> The max size should be something like min(reasonable_absolute_max_size, 
> max_mutation_size, commitlog_segment_size / 2). reasonable_absolute_max_size 
> could be like 16M or sth.
> A mutation shouldn't be too large as it also affects MV partition locking. 
> The longer a MV partition is locked during a stream, the higher chances are 
> that WTE's occur during streams.
> I could also imagine that a max number of updates per mutation regardless of 
> size in bytes could make sense to avoid lock contention.
> Love to get feedback and suggestions, incl. naming suggestions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to