[ https://issues.apache.org/jira/browse/CASSANDRA-13299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16134733#comment-16134733 ]
ZhaoYang edited comment on CASSANDRA-13299 at 8/21/17 11:54 AM: ---------------------------------------------------------------- [trunk|https://github.com/jasonstack/cassandra/commits/CASSANDRA-13299-trunk] [dtest|https://github.com/riptano/cassandra-dtest/commits/CASSANDRA-13299 ] Changes: 1. Throttle by number of base unfiltered. default is 100. 2. A pair of open/close range tombstone could have any number of unshadowed rows in between. In the patch, when reaching the limit of each batch, if there is an open range-tombstone-mark, it will generate a corresponding close marker for it. It's to avoid handling range-tombstone-mark separately from row which costs 1 more read-before-write for each pair of markers. This also help to reduce the impact of a large range tombstone. 3. Partition deletion is only applied on first mutation to avoid reading entire partition more than once. Note: One partition deletion or a range deletion could cause huge number of view rows to be removed, thus view mutation may fail to apply due to WTE or max_mutation_size, but it could be resolved separately in CASSANDRA-12783. Here, I only address the issue of holding entire partition into memory when repairing base with mv. was (Author: jasonstack): [trunk|https://github.com/jasonstack/cassandra/commits/CASSANDRA-13299-trunk] [dtest|https://github.com/riptano/cassandra-dtest/commits/CASSANDRA-13299 ] Changes: 1. Throttle by number of base unfiltered. default is 100. 2. A pair of open/close range tombstone could have any number of unshadowed rows in between. In the patch, when reaching the limit of each batch, if there is an open range-tombstone-mark, it will generate a corresponding close marker for it. Note: One partition deletion or a range deletion could cause huge number of view rows to be removed, thus view mutation may fail to apply due to WTE or max_mutation_size, but it could be resolved separately in CASSANDRA-12783. Here, I only address the issue of holding entire partition into memory when repairing base with mv. > Potential OOMs and lock contention in write path streams > -------------------------------------------------------- > > Key: CASSANDRA-13299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13299 > Project: Cassandra > Issue Type: Improvement > Reporter: Benjamin Roth > Assignee: ZhaoYang > > I see a potential OOM, when a stream (e.g. repair) goes through the write > path as it is with MVs. > StreamReceiveTask gets a bunch of SSTableReaders. These produce rowiterators > and they again produce mutations. So every partition creates a single > mutation, which in case of (very) big partitions can result in (very) big > mutations. Those are created on heap and stay there until they finished > processing. > I don't think it is necessary to create a single mutation for each partition. > Why don't we implement a PartitionUpdateGeneratorIterator that takes a > UnfilteredRowIterator and a max size and spits out PartitionUpdates to be > used to create and apply mutations? > The max size should be something like min(reasonable_absolute_max_size, > max_mutation_size, commitlog_segment_size / 2). reasonable_absolute_max_size > could be like 16M or sth. > A mutation shouldn't be too large as it also affects MV partition locking. > The longer a MV partition is locked during a stream, the higher chances are > that WTE's occur during streams. > I could also imagine that a max number of updates per mutation regardless of > size in bytes could make sense to avoid lock contention. > Love to get feedback and suggestions, incl. naming suggestions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org