>From the data I've seen, collected over many petabytes of customer
data, 1% is low -- but not by much.

There's another problem with globally convergent storage that we
haven't talked about yet: hot spots. If there are 100,000 users
accessing the same .mp3 file regularly, the nodes that store the
chunks for that file will be overloaded, and if you don't proactively
alleviate that, service will degrade to unacceptable levels. How can
you alleviate this problem? By making more copies of the hot data and
spreading them around the network. How many copies do you need? You
need a number that scales with the number of users accessing it. If
you approximate that number by storing a number of copies that is
proportional to the number of users storing it, you end up with
something that looks a lot like tossing out global convergent storage
in the first place.

So, getting rid of global convergent storage solves security problems,
privacy issues, scaling and performance issues, and architecture
issues (reference lists, reference counting, pre-store lookup costs,
etc), all at a cost of not much more total storage at all (in the
pathological case, you'd need at least as much storage to deal with
scaling using either approach). The complexity of the system is
reduced, which means less effort to create it and less opportunity for
error.

At what % of overlap does global convergent storage become useful? I'd
claim it would have to be pretty significant. The Microsoft paper
claiming high convergence did take into account all system files, at a
time when users didn't have the disk space to store much media in the
first place. In practice, backing up system files is of very low
utility -- "bare-metal restores" are not very common, mainly because
users usually want to do them to a different set of hardware, which
means you have to have all sorts of insane OS-specific domain
knowledge to do them properly. In practice, they don't make much sense
-- so much easier to reinstall OS from media and then restore your
data files.


On Mon, Jan 31, 2011 at 4:15 AM, Michael Militzer <mich...@xvid.org> wrote:
>
> BTW: Isn't the "Learn-Partial-Information" attack easy to combat by ensuring
> that unique files get always encrypted by a true random key? Because if a
> file is public (so someone else has it too) it contains no secret and is
> hence not susceptible to this attack. The decision whether to use convergent
> encryption or not could be made automatically by the system transparent to
> the user.

But how would you make this determination for the first of two
duplicate files to be stored? Wouldn't every file appear unique to the
system the first time it is seen? And even if you could do some sort
of global pre-imaging to find all non-unique files (which I think
requires that we assume all content is static), doesn't the global
mechanism violate privacy/security by being able to do that
computation in the first place?


> My understanding so far was that deduplication effects are a major reason
> why global, shared storage becomes more economic.

There is one undeniable benefit to doing deduplication: market
perception. Potential investors and acquirers seem to think it is
really something.


> I acknowledge however that the trend might go towards more uniqueness.
> People take more and more personal pictures (while the overall storage
> reuirement for pictures is comparatively small) and are also starting to
> record lots of personal video (which takes up lots more storage). However,
> I'd say that's not yet dominating. It's not the norm to have terrabytes
> of personal video and things like lifeblogging are still a vision.

Personal video and photos can dominate, and I'd wager that will be
even more true as we move forward. Even personal music collections
often become non-convergent on a file level, because many popular
media players (including itunes) add information to file headers while
playing -- star ratings and last played info, for example. This brings
us to another interesting point: if you really want to do
de-duplication and squeeze every ounce of savings out of it that you
can, you need to do it on a finer granularity than file-level.


> Yes, that's the case I thought about. Depending on the size of the erasure
> coded fragments (and the total file size) a file may be dispersed over
> very many storage nodes. So an attacker needs to compromise or collude
> with just a small number of storage nodes in order to successfully gather
> information. This is a very realistic threat.
>
> Do you know about measures to counter this? (Other than hiding the identity
> of the requester by onion routing or similar).

You can probably guess my suggestion by now ;) : do as Tahoe (and
other systems) do and just add an additional secret into the
convergent encryption key.


> Yes, I was also thinking of consumer data mainly for deduplication effects.
> However, if global convergent encryption has been observed to be just of
> negligible benefit (~1%) it can't frequently make up a great feature
> locally at the same time...

The local numbers (intra-user, intra-entity) are much higher. But even
if they weren't, there is one huge benefit to per-entity convergent
storage that you might not guess at first. It turns out that one
common user behavior is to buy a new machine, reinstall all their
software on it (including their backup software of choice), copy all
their files over to the new machine, then get rid of the old machine.
If the backup software doesn't have some clever mechanism to
reassociate the files from the old machine to the new machine, this
can result in a complete re-upload of all the files. Locally
convergent encryption/storage makes this painless. Other methods might
work too, but if they depend on the user following instructions and
isn't 100% automatic, you can be guaranteed that they won't do it
correctly and will end up re-uploading their entire stash.

Alen
_______________________________________________
p2p-hackers mailing list
p2p-hackers@lists.zooko.com
http://lists.zooko.com/mailman/listinfo/p2p-hackers

Reply via email to