Hi Zooko,

thanks a lot for your thorough response. I think we have not much
disagreement at all and the misunderstanding we initially had just
triggered a really interesting discussion after all.

So let me further comment on some of your thoughts in-line below:

Quoting Zooko O'Whielacronx <zo...@zooko.com>:

> On Wed, Feb 23, 2011 at 10:37 AM, Michael Militzer <mich...@xvid.org> wrote:
>>
>>> Replication costs more bandwidth on upload than erasure coding does
>>> (for a similar degree of fault-tolerance) as well as costing more
>>> storage.
>>
>> True. But the repair cost is minimal compared to replacing a lost erasure
>> coded fragment. So assuming long-lived data stored in a network of dynamic,
>> unreliable peers (P2P), repair may be frequently needed and cost much more
>> bandwidth over the whole life-time of a data object than the initial upload.
>> In such an environment replication can hence be more bandwidth efficient
>> than erasure coding - while of course requiring more storage in any case.
>
> I agree that we should think about repair costs?thank you for the
> reminder! I think we can reduce repair costs to as low or almost as
> low as replicating schemes.
>
> The cost of adding X bytes of replicated data to your grid (for
> example, in response to a server failure which destroyed X bytes) is
> about X bytes of bandwidth. The cost of doing so for erasure-coded
> data is about K*X bytes of bandwidth.

Yes, that's what I was referring to and what is also widely cited in the
literature. But as you correctly point out, there are possibilities to
improve the repair cost of erasure codes. In particular, I think repair
should not be modeled independent of availability and the actual repair
costs greatly depend on whether we can afford a lazy repair scheme or not.

Let's play a small example:

Consider our node availability is 60% and we want a stored file to be
available with a probability between three nines and six nines.

With erasure coding we get: k=16, n=57 (six nines) and n=45 (three nines)
With replication it is: k=1, n=15 (six nines) and n=8 (three nines)

Let's say we perform lazy repair and schedule a repair only when the
availability of the file has dropped to three nines.

In the case of erasure coding, we then need to replace 12 fragments. The
bandwidth cost is (16 + 12)*fragment_size = 1.75 * filesize.

For replication, we need to recreate 7 replicates, so the repair cost is
7.0 * filesize, which is much higher than for erasure coding.

So: The higher the degree of redundancy that is introduced for storage
the more will have to be repaired over the lifetime of an object. With
lazy repair, erasure coding can therefore have a substantial advantage.

> For one thing, you can choose to get the repair data from the cheapest
> peers. Not all bandwidth is equally costly?sometimes it varies
> tremendously. A good example of this is if you have a several physical
> locations each with a number of storage servers in it. Bandwidth
> within the co-lo is basically free and bandwidth between co-los is
> costly, so if one server fails and has to be repaired, if you can draw
> all of the repair data from peers in the same location then this might
> turn out to be cheaper than a replicating scheme. (For the replicating
> scheme to have another replica available in the same co-lo means that
> it has to have at least 2 replicas in each co-lo, while an
> erasure-coding scheme can ensure repair is possible within-co-lo at
> the cost of storing only 1 + 1/K of the data in each co-lo. Also if it
> has most but not all of the needed shares in-co-lo then it can pull
> only a few shares from out of co-lo.)

Fully agree. But in a pure P2P storage system (so "BitTorrent for storage")
this is of course less of an issue because peers don't pay for the transit
traffic directly but their ISPs. Of course, designing the system to keep
repair traffic local would be good also in the P2P use-case and make the
ISP happy.

> That same principle could apply to any storage network in which it is
> cheaper to transfer data from some peers than from others.
>
> Next, you can add the existence of a replica of each individual share.
> For example, if you have 3-out-of-6 encoding, then you keep two
> replicas of each of the six shares, then you have an overall
> redundancy of 4X. You can compare that to a replication-based scheme
> in which you spend the same amount of storage on four replicas of the
> data. With the erasure-coding you get better fault-tolerance and the
> repair costs are exactly as cheap as with replication, as long as you
> can find at least one replica of each damaged share.

You can do better than simply replicating the erasure coded fragments.
Have a look at the Twin-MDS framework that is based on the principle of
using vectors as source symbols:

http://www.ece.iisc.ernet.in/~vijay/storage/papers/TwinCodeFramework.pdf

Basically, it is quite close to your idea but has the advantage that any
lost fragment of Code 1 can be directly repaired as long as k out of n
fragments of Code 2 are still available (while simple replication requires
that exactly the replicate of the lost fragment is available).

The authors claim that Twin-MDS would be both storage and bandwidth
optimal, which is however not really true. For same availability, encoding
an object into two independent erasure codes requires obviously more
storage than using just one encoding. Also, because of what I outlined
above (lazy repair), the repair bandwidth of Twin-MDS is not necessarily
minimal.

Still, Twin-MDS codes are a very interesting concept and can be useful.

> Next, you can add a replica of the full data. If a single computer has
> a replicate of the original data (or, equivalently, if it has any K of
> the shares), then it can generate any share that it needs to in order
> to send a share to another computer to replace a missing share. This
> doesn't completely blow the storage efficiency advantage of
> erasure-coding (since you want *more* than one replica in your
> replication scheme) but it does let repair be just as cheap as in a
> replication scheme. Also notice that this "comes for free" in use
> cases where the user is keeping a copy of the full data anyway. For
> example, if you are using the distributed storage system for hosting
> your web site, and you have a copy of the web site on your laptop, and
> you periodically do a repair operation from your laptop. In that case,
> the cost of repair is exactly the cost of bandwidth from your laptop
> to the servers times the amount of missing data.

Yes, that's another possible combination of erasure coding and replication.
Indeed, if the owner keeps a local copy of his data an owner-induced repair
is cheapest. For all other cases, Twin-MDS is the best mixture of erasure
coding and replication I know of. That's also because it allows a multi-
source strategy for repair (k storage nodes can work together to repair one
lost fragment), so it offers better load balancing and fault tolerance than
keeping a copy of all k shares at one node.

> Finally, the advantages of erasure-coding in storage efficiency also
> mean that you have fewer things that can break. If a replication
> scheme uses three-way replication but an erasure coding scheme uses
> 3-out-of-6 encoding, for example, then the erasure coding scheme
> requires only 2/3 as many servers as the replication scheme does,
> which means 2/3 as many failures to recover from. (And it has
> similarly good fault-tolerance.)

Yes. Your observation perfectly complements the example I gave above:
Higher degree of redundancy means more data can break and need be repaired.

>> Correct. But you can design the system in such a way that the data will
>> still be available as long as only a certain percentage of the nodes are
>> "honest". Then the key to achieve a reliable storage system reduces to
>> making it practically impossible that any single malicious entity can
>> gain control over the necessary set of network nodes. And here, size
>> becomes your friend: The larger the network and the wider the stored data
>> is dispersed, the harder for a single attacker to reach the necessary
>> percentage to break availability. That's mathematical magic too ;)
>
> I am skeptical about this. First of all: what's your defense against
> the straight-up Sybil attack?

I agree that sybils can not be eliminated from the network. So the goal
must be to make it impractical for an attacker to gain control over a
significant fraction of the total network.

And for this, I think you need mainly two properties: a) NodeIDs must be
guaranteed to be random and b) admission of new nodes to the system must
have a cost.

The admission cost should not only be a computational cost but there should
be manual, human interaction required to prevent an attacker from easily
and automatically creating sybils. This could be achieved by requiring a
captcha to be solved or binding nodes to a mobile phone number (so the
admission could involve sending a SMS message).

Obviously, such schemes can be realized only with a central entity for
admission control. If the central agent is however needed just once for
initial admission and has no control over the storage nodes I think it
is neither a single point of failure nor a threat to the integrity of the
system.

In this regard, refer also to Luca Aiello's Likir DHT:

http://likir.di.unito.it/papers/p2p08.pdf

So I think it is possible to mitigate the adverse effect of sybil attacks
quite a lot in theory. I'm more concerned that by a very strict admission
control the hurdle to join the network will become too high also for the
average (honest) user. In order to gain high adoption among users a
"plug&play" experience is needed. So with too much complexity to join the
network the system probably has high attack-resistance on paper but it is
of no use in practice if the network then fails to gain popularity. So a
good middle-ground is needed here.

And I believe there is also no need to eliminate the possibility of sybils.
It is sufficient to make it a time-consuming and costly effort. A P2P
file storage system is not static but should show a certain growth rate
when having some popularity. This means that an attacker needs to
consistently beat the growth of honest nodes in the system. And I'm quite
positive that techniques as the ones mentioned above can make admission
costly enough to make this impractical in the long run.

> Second, I no longer believe that we can
> model the behavior of this large set of servers well enough to use
> them effectively for long-term reliability. Basically, the
> mathematical magic doesn't work for this! I am greatly influenced by
> Nicholas Nassim Taleb's idea of the "ludic fallacy":
>
> http://www.fooledbyrandomness.com/LudicFallacy.pdf

I know Taleb but I don't quite like his attitude. The problem is not that
statisticians are idiots or math doesn't work. Everybody knows that stock
market movements cannot be accurately modeled by random walk and that the
probability of stock returns is not a gaussian curve distribution. The
presence of fait-tails or "black swan" events and the inability of simple
models to cover them is neither new nor surprising.

So the conclusion should not be to abandon systems based on statistical
models because "they don't work". Models are an approximate and inaccurate
view on reality on purpose. It is not wrong to make assumptions as long as
we are prepared that reality is different and models can fail.

In that regard, I'd like to point out that data persistence is not only
threatened in the "BitTorrent for storage" case but that there exist black
swans for cloud storage too:

http://www.networkworld.com/news/2008/081108-linkup-failure.html?page=1

And I believe that, interstingly, the out-of-the-box risks associated with
P2P storage and cloud storage are not the same. Therefore, to mitigate the
impact from black swans diversification can be our friend (see below).

> Also by the fact that for many years before I read that I had
> struggled to implement exactly this sort of system (in Mojo Nation,
> Mnet, and Allmydata) and failed. Of course, my failure doesn't prove
> that it is impossible, but it did turn out to be harder than I
> thought.

Well, I very much liked the vision behind Mojo Nation. Maybe it simply
was too much ahead of its time. Meanwhile, we have seen 10 years of
further research in the P2P area and many problems that were unsolved
at the time of Mojo Nation now have a solution. I agree though that
complexity still is a major hurdle.

> I know that this company is trying a similar technique:
> http://memoryboxbackup.com . They have centralized control over all
> client and server behavior such as placement of shares, which should
> make it easier. I don't know how well they are doing. Likewise with
> Wuala. (It can be hard to learn from closed-source startups.)

For Wuala, there are some papers available on the algorithms used
because Wuala is a spin-off from research conducted at university of
Zuerich, Switzerland (if I remember correctly). I don't remember having
read about any particular methods to increase attack resilience (in the
sense we discuss it here) for Wuala. I guess they rely on the obscurity
of their protocol for attack prevention. And for the time being (so as
long as the service doesn't become too popular) this should prove ok in
practice.

> On the other hand, even if we don't know how to use a bunch of random
> strangers for reliable long-term storage, we may well be able to use
> them for *short-term* purposes. For example, if we have a file which
> we have safely stored somewhere else but we wish to upload it to this
> P2P storage grid for file-sharing and hosting purposes. If the file
> accidentally or maliciously disappears from that grid, we'll
> eventually notice and re-upload it. That might be useful.
>
> So maybe "BitTorrent for storage" is a good idea after all, for
> file-sharing and hosting instead of for long-term storage of your only
> copy of something precious. (Thanks to Rob Bryan for encouraging me to
> think along these lines the other night at the Boulder Hackerspace.)

Actually, my idea is that P2P storage ("BitTorrent for storage") and
well-defined storage clusters ("Tahoe", "cloud storage") are not mutually
exclusive. Both have their own advantages and drawbacks. The risks of
cloud computing have already been discussed. And the P2P storage has the
risk that it provides only a probabilistic availability based on often
certainly blurry assumptions. As you properly point out, there are no hard
guarantees in a P2P system. If the assumptions made upon system design
turn out to be very wrong in reality, data could get lost.

So what I have in mind is a combination of both techniques: If the best-
effort service level of the P2P system is enough (and you pointed out
examples where this can be the case) then users can enjoy the benefit of
the system without incurring any additional cost (other than sharing excess
resources). If however higher guarantees on data persistence are desired
than what the P2P system can provide, dispersing data over both P2P nodes
_and_ cloud storage nodes can be a solution. E.g. if we erasure code data
with 4x redundancy, simply spread 3*k fragments to P2P nodes and k
fragments to x cloud storage providers of your liking (with x in [1...k]).

Because of this diversification, the probability of data persistance in
such a system is higher than for each storage technique (P2P vs. cloud
storage cluster) alone.

[...]

>> did some actual tests with Alexandre Soro's Fermat-prime based   
>> reed-solomon implementation (which should be more efficient than   
>> zfec for k, n >= 256 - at least in theory): For k = 256 and n =   
>> 1024, I get a decoding speed of about 7.5 Mbit/s on my two years   
>> old notebook.
>
> Although, as discussed, I don't really see the use of large values of
> k and n like that, I would be curious about the results of comparing
> that implementation to zfec. Could you please do something like this:

Indeed, I think k=256 is not really needed. Maybe I got carried away a
bit too much by trying to get small sizes of erasure coded fragments and
also optimize the storage overhead for my targeted level of availability.

However, I think k should also not be smaller than 16. To give a number:
k = ~16 is about 30% more storage efficient than k = ~8 when storage
nodes have a low availability (e.g. ~50%).

> wget   
> http://pypi.python.org/packages/source/z/zfec/zfec-1.4.22.tar.gz#md5=105745eb9d3db8f909786a0b39153a79
> tar xzf zfec*.tar.gz
> cd zfec*
> python setup.py build_ext -i
> PYTHONPATH= python bench/bench_zfec.py --k=64 --m=256
>
> (That's the largest k and m with ratio 1/4 that zfec implements.)
>
> On my Macbook Pro (5,3) the worst run out of 64 runs was 4.5 MByte/sec
> (so approximately 36 MBit/sec).

I have ~3.1 MByte/sec with zfec for the worst run on my old notebook.
However, according to the output of "bench_zfec" it measures the
_encoding_ speed while the numbers I mentioned for Fermat-prime reed-
solomon were for decoding. And I think we may be more interested in
decoding speed since decoding can be considerably slower than encoding
especially for larger n/k.

http://personnel.isae.fr/jerome-lacan/article/fnt-based-reed-solomon-codes.html?lang=fr

Using similar parameters for the FNT coder, I get:

FNT-based MDS code
k = 64, n = 256, symbol size = 16000 bytes
3. Quadratic + log. encoder (low rates)
2. Quadratic direct decoder
Encoding speed : 64.82 Mbits/s
Decoding speed : 25.04 Mbits/s
Encoding/Decoding errors : 0

So that's ~8 MByte/s encoding speed but just ~3 MByte/s at decoding.
And considering that 16 MBit/s is a typical downstream bandwidth today
for households and there exist computers that are even slower than my
notebook, the decoder speed is not very satisfying - also for k=64.

[...]

> Thank you very much for the discussion!
>
> (By the way, I think I've worked on this letter every day for more
> than a week. I hope it is helpful!)

I really appreciate you taking the time. Thanks a lot!

Regards,
Michael



_______________________________________________
p2p-hackers mailing list
p2p-hackers@lists.zooko.com
http://lists.zooko.com/mailman/listinfo/p2p-hackers

Reply via email to