[s3ql] Unclear on zero deduplication and cache for local backend

p6su5t852s Mon, 29 Sep 2014 19:44:54 -0700

Hi,

I have recently tried out s3ql on Debian testing, and I have a few questions.


I'm using s3ql with local storage, without encryption or compression.  I set 
threads to 1 as a baseline.

I'm pretty confused about what I'm experiencing.  Firstly I don't understand 
the need for (or use of) a local cache when the backend is already a local 
filesystem.  I'd like to not have to store the data locally more than once, 
even if it's temporary.  I see that files are populated into 
~/.s3ql/local*cache/ and that they are on the same order as the files I'm 
writing into the actual mounted filesystem.  For a local filesystem backing, 
I'd like to have a mode where I'm not using a cache, even if it means I'm 
required to write data synchronously.  Apart from this request I'd also like to 
understand what's going on under the hood and if there are any parameters I can 
tweak when I'm using a local backend.

I find when I specify cachesize manually to be small or zero that my write 
throughput goes down by several orders of magnitude.  Is using no cache 
unsupported?  Does this introduce a deadlock or starvation of some type?  I 
understand if it's not supported or intended use for now, in which case I'll 
leave the cache at its default but it might be a useful test to see how the 
system works when the cache is small or nonexistent.  If I choose a smaller, 
but still a nonzero sized cache, I would want to ensure that I don't come 
across an arbitrary limitation.  I don't mind a small performance loss but when 
I use a zero cache size I get throughput of around 50 kilobytes per second, 
which suggests that I'm running up against an unexpected code path.  Read 
performance is okay even in that case.

The next thing I'm wondering a lot about is the deduplication.  In my test, I'm 
writing all zeroes.  I write a megabyte using one block of a 1MB block size 
using dd, and then I write 1024 blocks of a kilobyte each.  I then also write 
2MB or 4MB at a time.  I'd expect that deduplication would catch these very 
trivial cases and that I'd only see one entry of at most 2^n bytes, where 2^n 
represents the approximate block size of the deduplication.  I'd also expect 
2^n to be smaller than a megabyte (maybe like a single 64k block).  Is there a 
reason that deduplication is considering this data as different or unique?  I'd 
like to understand what's going on internally.  I'd like to be able to write 
any arbitrary configuration of zeroes of size 2^m and I'd still like to see at 
most one entry of size 2^n in the database, consisting of one block's worth of 
zeroes, even if m >> n.  As long as this is not being done, I'm a little 
confused about how the deduplication works, especially when the input is 
inherently duplicate.  I understand compression would bring down the size 
considerably but that would mask the issue I'm experiencing here.

Mike

-- 
You received this message because you are subscribed to the Google Groups 
"s3ql" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[s3ql] Unclear on zero deduplication and cache for local backend

Reply via email to