> On April 29, 2014, 7:32 p.m., Martin Kleppmann wrote: > > samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala, > > line 47 > > <https://reviews.apache.org/r/20811/diff/1/?file=569849#file569849line47> > > > > Gut feeling (not backed by any data) is that a threshold of 1000 might > > be quite low for a default. If there is a lot of data in the store, the > > compaction itself may start taking a long time. > > > > Rather than an absolute number, how about setting the threshold in > > terms of the proportion of keys in the keys? e.g. perform a compaction if > > more than (say) 20% of the keys in the store have been deleted? That way, > > if the store is big (=compaction is expensive), the threshold is > > automatically higher. > > > > (For purposes of that calculation I think it would be fine to simply > > count number of put requests and number of delete requests -- no need to > > track unique keys.) > > Martin Kleppmann wrote: > s/keys in the keys/keys in the store/
I can think of two ways in which this approach breaks: 1. People update the same key multiple times, and the put counter is improperly incremented. 2. People delete a key that doesn't exist, and the delete counter is improperly incremented. (2) doesn't bother me much, since it's kind of erroneous, but (1) concerns me. Using the counting ratio, as you've described, would not work well in cases where people are over-writing existing keys. This is actually a pretty common use case, I think. In such a case, we would over-count puts, and would therefore drop our percentage, since the puts are in the denominator (deletes / puts = %). Assuming the over-counting is pretty predictable, developers could just drop their threshold from 20% to 10% (for example). Another thing to consider is how easy this config is to use and describe. I think it's a bit more confusing to have a % based threshold, especially if it's not accurate, and leads to weird issues like (1) and (2), above. Regarding 1000 being too low, in cases where the cache is used (and I've never seen it not be used), the threshold is somewhat misleading, since the cache will cache deletes, which prevents the delete counter from incrementing in the LevelDB store. This means you can have over 1000 deletes, but still not trigger a compaction. For that reason, I'm going to raise the threshold to be equal to the cache size by default. This will ensure that if you execute >= deleteThreshold deletes, then call get(), your deletes will be flushed to the LevelDB store. - Chris ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/20811/#review41764 ----------------------------------------------------------- On April 28, 2014, 10:52 p.m., Chris Riccomini wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/20811/ > ----------------------------------------------------------- > > (Updated April 28, 2014, 10:52 p.m.) > > > Review request for samza. > > > Repository: samza > > > Description > ------- > > add javadocs, and reset deletion counter in compact > > > make delete threshold configurable. add a performance test (takes 25s to run). > > > make compaction lazy on read-side so we can take advantage of cached writes > > > trigger compactions periodically to remove deleted keys from levels > > > Diffs > ----- > > > samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala > 81fe86165019f72a15be1ac9cfcfff0598b4b92b > > samza-kv/src/main/scala/org/apache/samza/storage/kv/LevelDbKeyValueStore.scala > 8602a328673e6fa7d435366abcd9a96a99d9cd88 > > samza-kv/src/test/scala/org/apache/samza/storage/kv/TestKeyValueStores.scala > 85ba11a3362ad7cf4f84fbcbd944cd790e572cbe > > Diff: https://reviews.apache.org/r/20811/diff/ > > > Testing > ------- > > > Thanks, > > Chris Riccomini > >
