Alen, Quoting Alen Peacock <alenlpeac...@gmail.com>:
> On Fri, Jan 28, 2011 at 11:21 AM, Michael Militzer <mich...@xvid.org> wrote: > >> This is a problem I thought about a lot. That's because there is an >> information leakage in the system if one party controls a certain number of >> nodes in the network. Data de-duplication has the advantage that it saves >> storage and that data already present in the system does not have to be >> uploaded twice. The disadvantage however is that same data encrypts to the >> same storage block. This allows a sybil attacker to upload a number of >> "interesting" files and log who else in the network accesses the same files. >> This is easily possible even though all stored and transmitted data is >> encrypted. So encryption is not enough to ensure privacy. >> >> Had this been discussed throughout the design of flud? > > I am more convinced than ever that global convergent storage (or > "single-instance storage," or "deduplication") is of dubious benefit. > Part of the reasons for that stem from sources and studies that are > are not publicly available, so I'm not at liberty to discuss them ;) . > But even more compelling than those results regarding the practical > advantages are the simple attacks outlined by Zooko and the Tahoe team > here: http://www.mail-archive.com/cryptography@metzdowd.com/msg08949.html Thanks a lot for the link. I didn't know about this discussion, was unaware of the second possible attack (Learn-Partial-Information) and that the common reference term to this is "convergent encryption". Well, that's why I'm here... BTW: Isn't the "Learn-Partial-Information" attack easy to combat by ensuring that unique files get always encrypted by a true random key? Because if a file is public (so someone else has it too) it contains no secret and is hence not susceptible to this attack. The decision whether to use convergent encryption or not could be made automatically by the system transparent to the user. Ok, one may argue that if the content is not secret anyway we don't need to encrypt it in the first place. But it seems that encryption has some merit also for public files and if just to make it hard for individual storage nodes to see what information they are storing (like in Freenet) and as a simple way of access control. But yes, the "deduplication" feature causes a lot of problems - both from a security/privacy point of view as well as regarding the administrative overhead. So if it has no benefits and just causes problems I'd be glad to drop the idea. But is the benefit really so sparse? (like the 1% mentioned by Zooko?) My understanding so far was that deduplication effects are a major reason why global, shared storage becomes more economic. There's economy of scale: The more participants in the storage network, the more data gets pooled, the larger the degree of deduplication, the lower the cost per stored byte for the individual user. If this effect does not exist at all (-> 1%) this would make a dramatic difference. So this requires more detailed study and analysis. I agree that data from businesses are probably dominated by databases that are unique and that there is hence little redundancy to remove. But for private computers I'd intuitively guess that most of the data stored is not unique. There's the OS and lots of software that are not unique. Files downloaded from the internet - not unique. Even your private CD collection ripped to MP3 is likely not unique and so on... I acknowledge however that the trend might go towards more uniqueness. People take more and more personal pictures (while the overall storage reuirement for pictures is comparatively small) and are also starting to record lots of personal video (which takes up lots more storage). However, I'd say that's not yet dominating. It's not the norm to have terrabytes of personal video and things like lifeblogging are still a vision. > There were some rudimentary protections in flud against an entity > storing "interesting" files and then fishing for other users who also > stored them. One of these was that storing nodes would not reveal > identities of nodes storing blocks except to provably owning nodes, > and even then, only the single identity of that node itself (using the > self-certifying IDs and challenge/response pairs). This of course is > insufficient if a storing node is compromised or colludes with the > originator of the fishing expedition -- another good reason to not do > global convergent encryption. Yes, that's the case I thought about. Depending on the size of the erasure coded fragments (and the total file size) a file may be dispersed over very many storage nodes. So an attacker needs to compromise or collude with just a small number of storage nodes in order to successfully gather information. This is a very realistic threat. Do you know about measures to counter this? (Other than hiding the identity of the requester by onion routing or similar). I thought about a scheme based on multi-server private information retrieval. So instead of directly downloading a block belonging to a certain file you instead request linear or polynomial combinations over a set of blocks belonging to different files from multiple peers so that they allow you to reconstruct the data you are interested in locally without exposing the information which data you actually wanted to any other single peer. In the most basic scheme this is secure as long as all of the N peers you communicate with to reconstruct the block are independent, so none colludes with any of the others. If the grouping which peers can become a member of this N peer group is controlled (e.g. they must be neighbours in DHT address space) it will become rather hard for an attacker to fish for information. The disadvantage is that there will be a replication degree of N and communication overhead of N/(N-1). >> Another problem derived from de-duplication is deletion of data. Data >> can only be deleted when it is not referenced anymore by any user in the >> system. This means that also the original uploader may not be allowed to >> actually delete the file. Something like a ref-counter or delete token >> is needed. How does flud solve this problem? > > There is still likely a lot of benefit from convergent encryption > within a single entity (think of a consumer with backing up two > machines that each have an entire music collection duplicated, or of a > small business with 10 users sharing many of the same files). Yes, I was also thinking of consumer data mainly for deduplication effects. However, if global convergent encryption has been observed to be just of negligible benefit (~1%) it can't frequently make up a great feature locally at the same time... Best regards, Michael _______________________________________________ p2p-hackers mailing list p2p-hackers@lists.zooko.com http://lists.zooko.com/mailman/listinfo/p2p-hackers