> On April 29, 2014, 7:32 p.m., Martin Kleppmann wrote: > > samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala, > > line 47 > > <https://reviews.apache.org/r/20811/diff/1/?file=569849#file569849line47> > > > > Gut feeling (not backed by any data) is that a threshold of 1000 might > > be quite low for a default. If there is a lot of data in the store, the > > compaction itself may start taking a long time. > > > > Rather than an absolute number, how about setting the threshold in > > terms of the proportion of keys in the keys? e.g. perform a compaction if > > more than (say) 20% of the keys in the store have been deleted? That way, > > if the store is big (=compaction is expensive), the threshold is > > automatically higher. > > > > (For purposes of that calculation I think it would be fine to simply > > count number of put requests and number of delete requests -- no need to > > track unique keys.) > > Martin Kleppmann wrote: > s/keys in the keys/keys in the store/ > > Chris Riccomini wrote: > I can think of two ways in which this approach breaks: > > 1. People update the same key multiple times, and the put counter is > improperly incremented. > 2. People delete a key that doesn't exist, and the delete counter is > improperly incremented. > > (2) doesn't bother me much, since it's kind of erroneous, but (1) > concerns me. Using the counting ratio, as you've described, would not work > well in cases where people are over-writing existing keys. This is actually a > pretty common use case, I think. In such a case, we would over-count puts, > and would therefore drop our percentage, since the puts are in the > denominator (deletes / puts = %). Assuming the over-counting is pretty > predictable, developers could just drop their threshold from 20% to 10% (for > example). > > Another thing to consider is how easy this config is to use and describe. > I think it's a bit more confusing to have a % based threshold, especially if > it's not accurate, and leads to weird issues like (1) and (2), above. > > Regarding 1000 being too low, in cases where the cache is used (and I've > never seen it not be used), the threshold is somewhat misleading, since the > cache will cache deletes, which prevents the delete counter from incrementing > in the LevelDB store. This means you can have over 1000 deletes, but still > not trigger a compaction. For that reason, I'm going to raise the threshold > to be equal to the cache size by default. This will ensure that if you > execute >= deleteThreshold deletes, then call get(), your deletes will be > flushed to the LevelDB store.
Ok, makes sense. Also, it's good that the compaction doesn't happen on the delete itself, but on the subsequent get. So if you have a million keys in the store, and delete half of them in one go, it will still only get compacted once, no matter what the threshold is set to. So that sounds fine to me. - Martin ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/20811/#review41764 ----------------------------------------------------------- On April 29, 2014, 10:37 p.m., Chris Riccomini wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/20811/ > ----------------------------------------------------------- > > (Updated April 29, 2014, 10:37 p.m.) > > > Review request for samza. > > > Repository: samza > > > Description > ------- > > add license header to new files > > > updating perf test to use delete threshold setting > > > disable this feature by default > > > default the delete compaction threshold to cache size > > > move performance test into samza-test > > > add javadocs, and reset deletion counter in compact > > > make delete threshold configurable. add a performance test (takes 25s to run). > > > make compaction lazy on read-side so we can take advantage of cached writes > > > trigger compactions periodically to remove deleted keys from levels > > > Diffs > ----- > > README.md a930cf020c46a616f4c7319881dbff3043b0fd49 > build.gradle 65d95d4c6eb220b617dab6ea5772c9062ab146bd > > samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala > 81fe86165019f72a15be1ac9cfcfff0598b4b92b > > samza-kv/src/main/scala/org/apache/samza/storage/kv/LevelDbKeyValueStore.scala > 8602a328673e6fa7d435366abcd9a96a99d9cd88 > > samza-kv/src/test/scala/org/apache/samza/storage/kv/TestKeyValueStores.scala > 85ba11a3362ad7cf4f84fbcbd944cd790e572cbe > samza-test/src/main/resources/perf/kv-perf.properties PRE-CREATION > > samza-test/src/main/scala/org/apache/samza/test/performance/TestKeyValuePerformance.scala > PRE-CREATION > > Diff: https://reviews.apache.org/r/20811/diff/ > > > Testing > ------- > > > Thanks, > > Chris Riccomini > >
