> On April 29, 2014, 7:32 p.m., Martin Kleppmann wrote:
> > samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala,
> >  line 47
> > <https://reviews.apache.org/r/20811/diff/1/?file=569849#file569849line47>
> >
> >     Gut feeling (not backed by any data) is that a threshold of 1000 might 
> > be quite low for a default. If there is a lot of data in the store, the 
> > compaction itself may start taking a long time.
> >     
> >     Rather than an absolute number, how about setting the threshold in 
> > terms of the proportion of keys in the keys? e.g. perform a compaction if 
> > more than (say) 20% of the keys in the store have been deleted? That way, 
> > if the store is big (=compaction is expensive), the threshold is 
> > automatically higher.
> >     
> >     (For purposes of that calculation I think it would be fine to simply 
> > count number of put requests and number of delete requests -- no need to 
> > track unique keys.)
> 
> Martin Kleppmann wrote:
>     s/keys in the keys/keys in the store/

I can think of two ways in which this approach breaks:

1. People update the same key multiple times, and the put counter is improperly 
incremented.
2. People delete a key that doesn't exist, and the delete counter is improperly 
incremented.

(2) doesn't bother me much, since it's kind of erroneous, but (1) concerns me. 
Using the counting ratio, as you've described, would not work well in cases 
where people are over-writing existing keys. This is actually a pretty common 
use case, I think. In such a case, we would over-count puts, and would 
therefore drop our percentage, since the puts are in the denominator (deletes / 
puts = %). Assuming the over-counting is pretty predictable, developers could 
just drop their threshold from 20% to 10% (for example).

Another thing to consider is how easy this config is to use and describe. I 
think it's a bit more confusing to have a % based threshold, especially if it's 
not accurate, and leads to weird issues like (1) and (2), above.

Regarding 1000 being too low, in cases where the cache is used (and I've never 
seen it not be used), the threshold is somewhat misleading, since the cache 
will cache deletes, which prevents the delete counter from incrementing in the 
LevelDB store. This means you can have over 1000 deletes, but still not trigger 
a compaction. For that reason, I'm going to raise the threshold to be equal to 
the cache size by default. This will ensure that if you execute >= 
deleteThreshold deletes, then call get(), your deletes will be flushed to the 
LevelDB store.


- Chris


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20811/#review41764
-----------------------------------------------------------


On April 28, 2014, 10:52 p.m., Chris Riccomini wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20811/
> -----------------------------------------------------------
> 
> (Updated April 28, 2014, 10:52 p.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> add javadocs, and reset deletion counter in compact
> 
> 
> make delete threshold configurable. add a performance test (takes 25s to run).
> 
> 
> make compaction lazy on read-side so we can take advantage of cached writes
> 
> 
> trigger compactions periodically to remove deleted keys from levels
> 
> 
> Diffs
> -----
> 
>   
> samza-kv/src/main/scala/org/apache/samza/storage/kv/KeyValueStorageEngineFactory.scala
>  81fe86165019f72a15be1ac9cfcfff0598b4b92b 
>   
> samza-kv/src/main/scala/org/apache/samza/storage/kv/LevelDbKeyValueStore.scala
>  8602a328673e6fa7d435366abcd9a96a99d9cd88 
>   
> samza-kv/src/test/scala/org/apache/samza/storage/kv/TestKeyValueStores.scala 
> 85ba11a3362ad7cf4f84fbcbd944cd790e572cbe 
> 
> Diff: https://reviews.apache.org/r/20811/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Chris Riccomini
> 
>

Reply via email to