Hallo Dieter, those are all (!!) very good questions and in fact those are the core things that I'll concentrate my work on in the future. I'm actually even writing my Master thesis about "minimizing bandwidth and disk usage for arbitrary storage types using deduplication and meta-chunking" (working title).
> 1) how do you handle deduplication on the storage layer and the > networking layer? (like, if user changes 2 random bytes in a 50MB file, > or renames a file, what kind of network traffic does this cause, and > what are the implications on storage consumption?) > [...] > it seems to me that using backends like ftp are very limiting factors > because some protocols are really dumb wrt. efficiency. > do you upload all files in small parts (how small?) to the ftp, as to > minimize the needed syncing for minimal changes in a big file? At the current stage, Syncany uses a fixed size chunking mechanism with configurable chunk size. In your example, if who bytes were changed and the chunk size was 512kb, 1MB would have to be transferred. And it gets worse: if one byte is added in the beginning of the file, all subsequent chunks change and must be retransmitted. This is of course not desirable at all, since it requires a significant overhead. In the future I'm planning to use (a) a sliding window based chunking algorithm, e.g. based on Rabin fingerprinting, (b) with very small chunks (8-16 KB). The algorithm I'll try first will be based on the "Two Threshold Two Divisor" algorithm [1]. That way, as a result of (a), if a byte was added in the beginning, only one or two chunks would change and would have to be retransmitted. As a result of (b), those chunks would be significantly smaller than in the current version (in your example maybe 16-32KB). To counteract the overhead per connection (one request per chunk), I intend to combine chunks to meta-chunks before uploading them to the storage. > is it supported to version everything (i.e. keep x > (or infinite) versions of > all files?). Right now, Syncany has no "cleaning" method to delete old revisions, so it keeps all changes from the first day on. It can assemble every from the chunks on the remote repository. > do you encrypt small blocks of the file? because re-encrypting a file > that has only changed a little will yield a completely different > encrypted variant, or not? I first chunk the file, and then encrypt the chunks that result from the files. If a small parts has changed, I can detect which chunks have changed, then take them, encrypt them and upload them. > maybe you could use rolling checksums like rsync does (but even that is > not ideal), I was using a rsyncs rolling checksum algorithm for a while, but switched to Adler32 for some reason. You can look at the current chunker code [2] and the not yet working TTTD chunker [3] if you like. > AFAIK git actually has a pretty efficient blob storage and > synchronisation system, you could put encrypted blobs in git's storage > system and get some features for free or camlistore... http://camlistore.org/ Problem with Git is that I would limit the use to only one protocol. Part of Syncany's goal is to support any storage out there. I briefly looked at camlistore. I suppose we could use it as storage, but since it's in early development as well, I think it's too early for that. > and what about when you keep, say 2GB in syncany (without modifying any > file or doing anything special), will it cause an additional 2GB (or > more) storage overhead, because it also needs to locally store the > encryted variant of all files? Syncany has a local cache that is only used temporarily to download chunks and meta files. Once they are processed, they could be deleted without causing any harm. At the moment, the cache is never cleaned, but that'll come sooner or later :-) > 2) how do you handle sharing between several users? At the moment, Syncany assumes that all users who can access a repository AND have the password have full access to all files. That is, if you only have access to the repository, you can delete all files, but not read them. If you only have the password, you could decrypt the files, but you cannot access them. If you have both, you have full access. I was thinking about cryptographic access control for a while, but it seems a lot of effort. Maybe some time in the future. > how about conflicts? are there means for manual and automatic conflict > resolving? Syncany does conflict resolution similar to Dropbox. If two users change the same file at the same time, it detects that and resolves the conflict by renaming the "loosing" file to "... (conflicted copy, ..)". The winner is the client that changed the file first (currently: local time; will be vector time or lamport time later). I hope I could help! If you have any suggestions, please let me know! Cheers, Philipp [1] http://www.hpl.hp.com/techreports/2005/HPL-2005-30R1.pdf [2] http://bazaar.launchpad.net/~binwiederhier/syncany/trunk/view/head:/syncany/src/org/syncany/index/Chunker.java [3] http://bazaar.launchpad.net/~binwiederhier/syncany/trunk/view/head:/syncany/src/org/syncany/index/TTTDChunker.java -- Mailing list: https://launchpad.net/~syncany-team Post to : [email protected] Unsubscribe : https://launchpad.net/~syncany-team More help : https://help.launchpad.net/ListHelp

