Alen,

Quoting Alen Peacock <alenlpeac...@gmail.com>:

> On Fri, Jan 28, 2011 at 11:21 AM, Michael Militzer <mich...@xvid.org> wrote:
>
>> This is a problem I thought about a lot. That's because there is an
>> information leakage in the system if one party controls a certain number of
>> nodes in the network. Data de-duplication has the advantage that it saves
>> storage and that data already present in the system does not have to be
>> uploaded twice. The disadvantage however is that same data encrypts to the
>> same storage block. This allows a sybil attacker to upload a number of
>> "interesting" files and log who else in the network accesses the same files.
>> This is easily possible even though all stored and transmitted data is
>> encrypted. So encryption is not enough to ensure privacy.
>>
>> Had this been discussed throughout the design of flud?
>
> I am more convinced than ever that global convergent storage (or
> "single-instance storage," or "deduplication") is of dubious benefit.
> Part of the reasons for that stem from sources and studies that are
> are not publicly available, so I'm not at liberty to discuss them ;) .
> But even more compelling than those results regarding the practical
> advantages are the simple attacks outlined by Zooko and the Tahoe team
> here: http://www.mail-archive.com/cryptography@metzdowd.com/msg08949.html

Thanks a lot for the link. I didn't know about this discussion, was unaware
of the second possible attack (Learn-Partial-Information) and that the
common reference term to this is "convergent encryption". Well, that's
why I'm here...

BTW: Isn't the "Learn-Partial-Information" attack easy to combat by ensuring
that unique files get always encrypted by a true random key? Because if a
file is public (so someone else has it too) it contains no secret and is
hence not susceptible to this attack. The decision whether to use convergent
encryption or not could be made automatically by the system transparent to
the user.

Ok, one may argue that if the content is not secret anyway we don't need to
encrypt it in the first place. But it seems that encryption has some merit
also for public files and if just to make it hard for individual storage
nodes to see what information they are storing (like in Freenet) and as a
simple way of access control.

But yes, the "deduplication" feature causes a lot of problems - both from
a security/privacy point of view as well as regarding the administrative
overhead. So if it has no benefits and just causes problems I'd be glad to
drop the idea. But is the benefit really so sparse? (like the 1% mentioned
by Zooko?)

My understanding so far was that deduplication effects are a major reason
why global, shared storage becomes more economic. There's economy of scale:
The more participants in the storage network, the more data gets pooled,
the larger the degree of deduplication, the lower the cost per stored byte
for the individual user.

If this effect does not exist at all (-> 1%) this would make a dramatic
difference. So this requires more detailed study and analysis. I agree that
data from businesses are probably dominated by databases that are unique
and that there is hence little redundancy to remove. But for private
computers I'd intuitively guess that most of the data stored is not unique.
There's the OS and lots of software that are not unique. Files downloaded
from the internet - not unique. Even your private CD collection ripped to
MP3 is likely not unique and so on...

I acknowledge however that the trend might go towards more uniqueness.
People take more and more personal pictures (while the overall storage
reuirement for pictures is comparatively small) and are also starting to
record lots of personal video (which takes up lots more storage). However,
I'd say that's not yet dominating. It's not the norm to have terrabytes
of personal video and things like lifeblogging are still a vision.

> There were some rudimentary protections in flud against an entity
> storing "interesting" files and then fishing for other users who also
> stored them. One of these was that storing nodes would not reveal
> identities of nodes storing blocks except to provably owning nodes,
> and even then, only the single identity of that node itself (using the
> self-certifying IDs and challenge/response pairs). This of course is
> insufficient if a storing node is compromised or colludes with the
> originator of the fishing expedition -- another good reason to not do
> global convergent encryption.

Yes, that's the case I thought about. Depending on the size of the erasure
coded fragments (and the total file size) a file may be dispersed over
very many storage nodes. So an attacker needs to compromise or collude
with just a small number of storage nodes in order to successfully gather
information. This is a very realistic threat.

Do you know about measures to counter this? (Other than hiding the identity
of the requester by onion routing or similar).

I thought about a scheme based on multi-server private information
retrieval. So instead of directly downloading a block belonging to a
certain file you instead request linear or polynomial combinations over
a set of blocks belonging to different files from multiple peers so that
they allow you to reconstruct the data you are interested in locally
without exposing the information which data you actually wanted to any
other single peer.

In the most basic scheme this is secure as long as all of the N peers you
communicate with to reconstruct the block are independent, so none colludes
with any of the others. If the grouping which peers can become a member of
this N peer group is controlled (e.g. they must be neighbours in DHT address
space) it will become rather hard for an attacker to fish for information.
The disadvantage is that there will be a replication degree of N and
communication overhead of N/(N-1).

>> Another problem derived from de-duplication is deletion of data. Data
>> can only be deleted when it is not referenced anymore by any user in the
>> system. This means that also the original uploader may not be allowed to
>> actually delete the file. Something like a ref-counter or delete token
>> is needed. How does flud solve this problem?
>
> There is still likely a lot of benefit from convergent encryption
> within a single entity (think of a consumer with backing up two
> machines that each have an entire music collection duplicated, or of a
> small business with 10 users sharing many of the same files).

Yes, I was also thinking of consumer data mainly for deduplication effects.
However, if global convergent encryption has been observed to be just of
negligible benefit (~1%) it can't frequently make up a great feature
locally at the same time...

Best regards,
Michael



_______________________________________________
p2p-hackers mailing list
p2p-hackers@lists.zooko.com
http://lists.zooko.com/mailman/listinfo/p2p-hackers

Reply via email to