>From the data I've seen, collected over many petabytes of customer data, 1% is low -- but not by much.
There's another problem with globally convergent storage that we haven't talked about yet: hot spots. If there are 100,000 users accessing the same .mp3 file regularly, the nodes that store the chunks for that file will be overloaded, and if you don't proactively alleviate that, service will degrade to unacceptable levels. How can you alleviate this problem? By making more copies of the hot data and spreading them around the network. How many copies do you need? You need a number that scales with the number of users accessing it. If you approximate that number by storing a number of copies that is proportional to the number of users storing it, you end up with something that looks a lot like tossing out global convergent storage in the first place. So, getting rid of global convergent storage solves security problems, privacy issues, scaling and performance issues, and architecture issues (reference lists, reference counting, pre-store lookup costs, etc), all at a cost of not much more total storage at all (in the pathological case, you'd need at least as much storage to deal with scaling using either approach). The complexity of the system is reduced, which means less effort to create it and less opportunity for error. At what % of overlap does global convergent storage become useful? I'd claim it would have to be pretty significant. The Microsoft paper claiming high convergence did take into account all system files, at a time when users didn't have the disk space to store much media in the first place. In practice, backing up system files is of very low utility -- "bare-metal restores" are not very common, mainly because users usually want to do them to a different set of hardware, which means you have to have all sorts of insane OS-specific domain knowledge to do them properly. In practice, they don't make much sense -- so much easier to reinstall OS from media and then restore your data files. On Mon, Jan 31, 2011 at 4:15 AM, Michael Militzer <mich...@xvid.org> wrote: > > BTW: Isn't the "Learn-Partial-Information" attack easy to combat by ensuring > that unique files get always encrypted by a true random key? Because if a > file is public (so someone else has it too) it contains no secret and is > hence not susceptible to this attack. The decision whether to use convergent > encryption or not could be made automatically by the system transparent to > the user. But how would you make this determination for the first of two duplicate files to be stored? Wouldn't every file appear unique to the system the first time it is seen? And even if you could do some sort of global pre-imaging to find all non-unique files (which I think requires that we assume all content is static), doesn't the global mechanism violate privacy/security by being able to do that computation in the first place? > My understanding so far was that deduplication effects are a major reason > why global, shared storage becomes more economic. There is one undeniable benefit to doing deduplication: market perception. Potential investors and acquirers seem to think it is really something. > I acknowledge however that the trend might go towards more uniqueness. > People take more and more personal pictures (while the overall storage > reuirement for pictures is comparatively small) and are also starting to > record lots of personal video (which takes up lots more storage). However, > I'd say that's not yet dominating. It's not the norm to have terrabytes > of personal video and things like lifeblogging are still a vision. Personal video and photos can dominate, and I'd wager that will be even more true as we move forward. Even personal music collections often become non-convergent on a file level, because many popular media players (including itunes) add information to file headers while playing -- star ratings and last played info, for example. This brings us to another interesting point: if you really want to do de-duplication and squeeze every ounce of savings out of it that you can, you need to do it on a finer granularity than file-level. > Yes, that's the case I thought about. Depending on the size of the erasure > coded fragments (and the total file size) a file may be dispersed over > very many storage nodes. So an attacker needs to compromise or collude > with just a small number of storage nodes in order to successfully gather > information. This is a very realistic threat. > > Do you know about measures to counter this? (Other than hiding the identity > of the requester by onion routing or similar). You can probably guess my suggestion by now ;) : do as Tahoe (and other systems) do and just add an additional secret into the convergent encryption key. > Yes, I was also thinking of consumer data mainly for deduplication effects. > However, if global convergent encryption has been observed to be just of > negligible benefit (~1%) it can't frequently make up a great feature > locally at the same time... The local numbers (intra-user, intra-entity) are much higher. But even if they weren't, there is one huge benefit to per-entity convergent storage that you might not guess at first. It turns out that one common user behavior is to buy a new machine, reinstall all their software on it (including their backup software of choice), copy all their files over to the new machine, then get rid of the old machine. If the backup software doesn't have some clever mechanism to reassociate the files from the old machine to the new machine, this can result in a complete re-upload of all the files. Locally convergent encryption/storage makes this painless. Other methods might work too, but if they depend on the user following instructions and isn't 100% automatic, you can be guaranteed that they won't do it correctly and will end up re-uploading their entire stash. Alen _______________________________________________ p2p-hackers mailing list p2p-hackers@lists.zooko.com http://lists.zooko.com/mailman/listinfo/p2p-hackers