[ https://issues.apache.org/jira/browse/CASSANDRA-13299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16180923#comment-16180923 ]
Paulo Motta commented on CASSANDRA-13299: ----------------------------------------- Thanks for the updates, this is looking good! I managed to reproduce an OOM when repairing a wide partition with 100K rows and verified that this patch avoids the OOM by splitting the partition in multiple batches (found CASSANDRA-13899 on the way). Awesome job! While splitting on the happy case is working nicely, we need to ensure the range tombstone handling (specially range deletions) is working correctly and well tested before committing this. I noticed that the previous {{ThrottledUnfilteredIterator}} implementation could [potentially return|https://github.com/jasonstack/cassandra/blob/b8cb49035d3cf77198b31df7c51a174fffe3edaf/src/java/org/apache/cassandra/db/rows/ThrottledUnfilteredIterator.java#L166] {{throttle+2}} unfiltereds, differently from the documentation which states that the maximum number of unfiltereds per batch is {{throttle+1}}. I also noticed that the case when there is a row between two markers was not being tested by existing tests, since we need a range deletion to reproduce this scenario. I fixed this and added more tests on [this commit|https://github.com/pauloricardomg/cassandra/commit/47d8ca3592cb6382bb4c308720646395306a0a69]. Could you also modify [complexThrottleWithTombstoneTest|https://github.com/jasonstack/cassandra/commit/b8cb49035d3cf77198b31df7c51a174fffe3edaf#diff-5162644c24391628b339b88c3619427cR66] to test range deletions? The previous change [requires|https://github.com/pauloricardomg/cassandra/commit/47d8ca3592cb6382bb4c308720646395306a0a69#diff-2acee8fea5cd82a51fda4af6e38faf13R60] the minimum throttle size to be 2, otherwise it would not be possible to make progress on the iterator in the presence of open and close markers. I think that instead of throwing an {{AssertionError}} when the returned iterator is not exhausted, we could simply exhaust it, effectively skipping entries, since this might be a possible usage of {{ThrottledUnfilteredIterator}} so I did this on [this commit|https://github.com/pauloricardomg/cassandra/commit/04ed5ecb5183195601950fc9efd2ca9123596487]. I also added an utility method {{ThrottledUnfilteredIterator.throttle(UnfilteredPartitionIterator partitionIterator, int maxBatchSize)}} to allow throttling an {{UnfilteredPartitionIterator}} transparently and used that on {{StreamReceiveTask}} [on this commit|https://github.com/pauloricardomg/cassandra/commit/4f8c3b8faa2644133d301ac7bf7b748f7ec265ee]. I had another look at the {{throttled_partition_update_test}} [dtest|https://github.com/riptano/cassandra-dtest/commit/f3307adef349f232ec0ae64e902164684f32cca0] and think we can make the following improvements: * Right now we're [verifying the results|https://github.com/riptano/cassandra-dtest/commit/f3307adef349f232ec0ae64e902164684f32cca0#diff-62ba429edee6a4681782f078246c9893R1410] with all the nodes UP, but it's possible that another node responds the query even though one of the inconsistent nodes did not stream correctly. I think we should check the results on each node individually (with the others down) to ensure they streamed data correctly from other nodes. * Add [range deletions|https://issues.apache.org/jira/browse/CASSANDRA-6237] since that's when the range tombstones special cases will be properly exercised. Please let me know what do you think about these suggestions. > Potential OOMs and lock contention in write path streams > -------------------------------------------------------- > > Key: CASSANDRA-13299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13299 > Project: Cassandra > Issue Type: Improvement > Components: Materialized Views > Reporter: Benjamin Roth > Assignee: ZhaoYang > Fix For: 4.x > > > I see a potential OOM, when a stream (e.g. repair) goes through the write > path as it is with MVs. > StreamReceiveTask gets a bunch of SSTableReaders. These produce rowiterators > and they again produce mutations. So every partition creates a single > mutation, which in case of (very) big partitions can result in (very) big > mutations. Those are created on heap and stay there until they finished > processing. > I don't think it is necessary to create a single mutation for each partition. > Why don't we implement a PartitionUpdateGeneratorIterator that takes a > UnfilteredRowIterator and a max size and spits out PartitionUpdates to be > used to create and apply mutations? > The max size should be something like min(reasonable_absolute_max_size, > max_mutation_size, commitlog_segment_size / 2). reasonable_absolute_max_size > could be like 16M or sth. > A mutation shouldn't be too large as it also affects MV partition locking. > The longer a MV partition is locked during a stream, the higher chances are > that WTE's occur during streams. > I could also imagine that a max number of updates per mutation regardless of > size in bytes could make sense to avoid lock contention. > Love to get feedback and suggestions, incl. naming suggestions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org