Re: [MirageOS-devel] irmin storage overhead and dedup

Gregory Tsipenyuk Wed, 31 Dec 2014 06:48:36 -0800

I ran all tests on empty repositories.

Does it make sense to have “benchmark” folder under Irmin to check in the tests?


> On Dec 31, 2014, at 5:32 AM, Thomas Gazagnaire <[email protected]> wrote:
> 
>> I looked at the metadata that gets created for every email message and it’s 
>> small - less than 100 bytes. So I ran a simple test of appending 20,000 
>> unique 100 bytes ascii messages. I would have expected the repository size 
>> to be on the order of a few megabytes, instead it was 4.7G. This is roughly 
>> 234K overhead per 100 bytes message, which would be quite impractical for 
>> the email storage with the metadata essentially exceeding the message 
>> storage.
> 
> Did you start from an empty repository? Would be interested to run your code 
> locally to check what happens. 
> 
> More generally all the benchs/experiments you are running are very useful, it 
> would be nice to put them somewhere online and turn them into functional 
> tests to run them regularly to check that the serialisation format doesn't go 
> crazy.
> 
> Thanks!
> Thomas
> 
> 
> 
> 
>> 
>> Gregory
>> 
>>> On Dec 30, 2014, at 7:07 PM, Gregory Tsipenyuk <[email protected]> wrote:
>>> 
>>> Hi Thomas,
>>> 
>>> I’m trying to figure out what kind of storage overhead and dedup I get in 
>>> Irmin. First I tried to convert the google email archive (2.4G) to the IMAP 
>>> server Irmin format . After conversion the size of the git repository was 
>>> twice the size of the original archive. I do have some additional 
>>> structures that I create, like per mailbox index and summary statistics and 
>>> per email message flags so perhaps the extra size is coming from those 
>>> structures though it seems a bit high. I will have to estimate the expected 
>>> size from additional structures to understand this result. Next I dumped 
>>> into irmin 2,000 of 1M files with random ascii content which resulted in 
>>> the git repository size of 950M. I figure Irmin compresses the content, 
>>> right? To verify this I dumped 2,000 of 2.4M image files with concatenated 
>>> counter to make the content unique. The size of repository for this was 
>>> 4.6G, which is expected. Then I repeated the last test but with identical 
>>> images and this time the size was 27M, which was clearly a nice proof of 
>>> the deduping by Irmin. My question is whether the compression in Irmin is 
>>> configurable? Can it be configurable per individual content? For instance, 
>>> I don’t want to compress images as there is nothing to gain from the space 
>>> saving and consequently there is unnecessary resource usage but I do want 
>>> to compress the text if the compression overhead is reasonable. I can 
>>> figure out the type of content from MIME type in IMAP server.
>>> 
>>> Thanks 
>>> Gregory
>> 
> 


_______________________________________________
MirageOS-devel mailing list
[email protected]
http://lists.xenproject.org/cgi-bin/mailman/listinfo/mirageos-devel

Re: [MirageOS-devel] irmin storage overhead and dedup

Reply via email to