Hi, Quoting Alen Peacock <alenlpeac...@gmail.com>:
> From the data I've seen, collected over many petabytes of customer > data, 1% is low -- but not by much. > > There's another problem with globally convergent storage that we > haven't talked about yet: hot spots. If there are 100,000 users > accessing the same .mp3 file regularly, the nodes that store the > chunks for that file will be overloaded, and if you don't proactively > alleviate that, service will degrade to unacceptable levels. How can > you alleviate this problem? By making more copies of the hot data and > spreading them around the network. How many copies do you need? You > need a number that scales with the number of users accessing it. If > you approximate that number by storing a number of copies that is > proportional to the number of users storing it, you end up with > something that looks a lot like tossing out global convergent storage > in the first place. That's true. This can be a problem but it depends on the deduplication degree and access frequency. In a truly global system, the times where people access their data should distribute equally over the whole day (because of different time zones), which reduces the probability of simultaneous access. Of course, there can be local hot spots. But to allow efficient simultaneous access there's no need to hold additional persistant copies of a file. You could handle simultaneous accesses in a bittorrent-like way (which of course poses new privacy problems). But the problem of service quality and potentially overloading single nodes exists also without convergent storage. So it probably makes sense to focus more on this: In a P2P file store you can expect that the single nodes will have very different characteristics with regard to upstream bandwidth and storage they can dedicate to the network. For service quality, the storage / bandwidth ratio is important. If a node has very high storage capacity but just small bandwidth it might quickly become overloaded by requests and a bottleneck to the system. We could restrict the storage / bandwidth ratio to a constant value to obtain equal nodes but then waste available storage and bandwidth. Another idea would be to consider access frequencies and push less frequently used data blocks behind slow links with a higher probability. But this increases the implementational complexity and maintenance cost of the system quite a lot again. > At what % of overlap does global convergent storage become useful? I'd > claim it would have to be pretty significant. The Microsoft paper > claiming high convergence did take into account all system files, at a > time when users didn't have the disk space to store much media in the > first place. In practice, backing up system files is of very low > utility -- "bare-metal restores" are not very common, mainly because > users usually want to do them to a different set of hardware, which > means you have to have all sorts of insane OS-specific domain > knowledge to do them properly. In practice, they don't make much sense > -- so much easier to reinstall OS from media and then restore your > data files. Among consumers it's maybe indeed common to just backup personal files and freshly reinstall the system in case of disk fault. In business, that's not the case. In case of failure you need to restore as quickly as possible. There's no time to reinstall the OS from media. Also, it's not needed. The hardware should normally be virtualized, so you can run your backup-ed OS also on a new set of hardware. Of course, there's enough backup software available that already deduplicates data between the different VMs running on one physical machine, so the effects from a convergent storage on top may be limited. > On Mon, Jan 31, 2011 at 4:15 AM, Michael Militzer <mich...@xvid.org> wrote: >> >> BTW: Isn't the "Learn-Partial-Information" attack easy to combat by ensuring >> that unique files get always encrypted by a true random key? Because if a >> file is public (so someone else has it too) it contains no secret and is >> hence not susceptible to this attack. The decision whether to use convergent >> encryption or not could be made automatically by the system transparent to >> the user. > > But how would you make this determination for the first of two > duplicate files to be stored? Wouldn't every file appear unique to the > system the first time it is seen? And even if you could do some sort > of global pre-imaging to find all non-unique files (which I think > requires that we assume all content is static), doesn't the global > mechanism violate privacy/security by being able to do that > computation in the first place? Hm yes, I've thought about maintaining a "file tag" that allows a peer to determine whether his file was already seen by the system or not. Only if seen before, convergent encryption is used and the file re-uploaded. The user who first placed the tag (but not used convergent encryption for security reasons) could then later also reference the convergent copy and reclaim the space used up by his initial private copy. I guess that's the same idea than what you call "global pre-imaging". Indeed, this works efficiently only for static data. But on the other hand: How probable is it that non-static data is actually public and referenced by other, arbitrary users? However, I don't really see the privacy / security issue from such a tag. If the tag is the result of some hash function that's not used for anything else in the system, the tag should not expose any information about private files. For non-private files you could obviously look up if a certain file is present in the storage system (if you have the file yourself to calculate the hash). But such a check is possible also without the tag when convergent encryption is used... [...] >> I acknowledge however that the trend might go towards more uniqueness. >> People take more and more personal pictures (while the overall storage >> reuirement for pictures is comparatively small) and are also starting to >> record lots of personal video (which takes up lots more storage). However, >> I'd say that's not yet dominating. It's not the norm to have terrabytes >> of personal video and things like lifeblogging are still a vision. > > Personal video and photos can dominate, and I'd wager that will be > even more true as we move forward. Even personal music collections > often become non-convergent on a file level, because many popular > media players (including itunes) add information to file headers while > playing -- star ratings and last played info, for example. This brings > us to another interesting point: if you really want to do > de-duplication and squeeze every ounce of savings out of it that you > can, you need to do it on a finer granularity than file-level. Indeed. But this brings back the "Learn-Partial-Information" attack problem. You'd need to separate the static payload (actual compressed music data) from the dynamic meta-data (header info like ID3 tags) and store it independently. You could add an "intelligent" filesystem layer on the client-side that does this separation. But of course it again makes things lots more complex. Thanks for all your input! Regards, Michael _______________________________________________ p2p-hackers mailing list p2p-hackers@lists.zooko.com http://lists.zooko.com/mailman/listinfo/p2p-hackers