[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071538#comment-14071538 ]
Benedict edited comment on CASSANDRA-7546 at 7/23/14 9:14 AM: -------------------------------------------------------------- It doesn't look to me like we re-copy the ranges (only the arrays we store them in) On your patch, I have a couple of minor concerns: * I would rather we didn't increase the amount of memory we use. In 2.1 I'm stricter about this, because in 2.0 we can mitigate it by replacing AtomicReference with a volatile and an AtomicReferenceFieldUpdater. But whatever we do in 2.1 has to be free memory-wise. This means we have 1 integer or 1 reference to play with in the outer class (not the holder), as we can get this for free. We don't need to maintain a size in 2.1 though, so this is easy. We can track the actual amount of memory allocated (since we already do this). * I would rather make the condition for upgrading to locks be based on some rate of wasted work (or, since it works just as well, some rate of wasted memory allocations). The current value seems a bit clunky and difficult to tune, and might be no real indication of contention. However we need to keep this encoded in an integer, and we need to ensure it is free to maintain in the fast case. So I propose the following: # we decide on a maximum rate of waste (let's say 100MB/s) # when we first waste work we: #* get the current time in ms (but from nanoTime since we need monotonicity); #* subtract from it our max rate (100Mb/s) converted to K/s, i.e. 100 * 1024, so we have present-100*1024; #* set our shared counter state to this value # whenever we waste work we: #* we calculate how much we wasted\* in Kb #* we add this to our shared counter; #* if the shared counter has _gone past the present time_ we know we've exceeded our maximum wastage, and we set our counter to Integer.MAX_VALUE which is the flag to everyone to upgrade to locks; #* if we see it's too in the past, we reset it to present-(100*1024) \* To calculate wasted work, we track the size you currently are tracking in 2.0, and in 2.1 we use the BTree's existing size-delta tracking. In 2.0 we multiply the number of updates we had made by by lg2(N) (N = current tree size), and multiple this by 100 (approximate size of snaptree nodes) + ~200 per clone. We divide the result by 1024. This is the same scheme I used for tracking wasted cycles in SharedExecutorPool (CASSANDRA-4718) and I think it works pretty well, and is succinctly represented in memory. was (Author: benedict): It doesn't look to me like we re-copy the ranges (only the arrays we store them in) On your patch, I have a couple of minor concerns: * I would rather we didn't increase the amount of memory we use. In 2.1 I'm stricter about this, because in 2.0 we can mitigate it by replacing AtomicReference with a volatile and an AtomicReferenceFieldUpdater. But whatever we do in 2.1 has to be free memory-wise. This means we have 1 integer or 1 reference to play with in the outer class (not the holder), as we can get this for free. We don't need to maintain a size in 2.1 though, so this is easy. We can track the actual amount of memory allocated (since we already do this). * I would rather make the condition for upgrading to locks be based on some rate of wasted work (or, since it works just as well, some rate of wasted memory allocations). The current value seems a bit clunky and difficult to tune, and might be no real indication of contention. However we need to keep this encoded in an integer, and we need to ensure it is free to maintain in the fast case. So I propose the following: # we decide on a maximum rate of waste (let's say 100MB/s) # when we first waste work we: #* get the current time in ms (but from nanoTime since we need monotonicity); #* subtract from it our max rate (100Mb/s) converted to K/s, i.e. 100 * 1024, so we have present-100*1024; #* set our shared counter state to this value # whenever we waste work we: #* we calculate how much we wasted\* in Kb #* we add this to our shared counter; #* if the shared counter has _gone past the present time_ we know we've exceeded our maximum wastage, and we set our counter to Integer.MAX_VALUE which is the flag to everyone to upgrade to locks; #* if we see it's too in the past, we reset it to present-(100*1024) \* To calculate wasted work, we track the size you currently are tracking in 2.0, and in 2.1 we use the BTree's existing size-delta tracking. In 2.0 we multiply the number of updates we had made by by lg2(N) (N = current tree size), and multiple this by 100 (approximate size of snaptree nodes) + ~200 per clone This is the same scheme I used for tracking wasted cycles in SharedExecutorPool (CASSANDRA-4718) and I think it works pretty well, and is succinctly represented in memory. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > ----------------------------------------------------------------------------- > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core > Reporter: graham sanderson > Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)