[freenet-dev] FEC and memory usage

2010-04-14 Thread Matthew Toseland
On Wednesday 14 April 2010 17:52:10 Evan Daniel wrote:
> I've been investigating potential improvements to our FEC encoding in
> my spare time (in particular the use of LDPC codes and their
> relatives, but the following is generally applicable).  I'd like to
> ask for opinions on what assumptions I should be making about
> acceptable levels of CPU time, memory usage, and disk usage.
> 
> We care both about how well our FEC codes work, and how fast they are.
>  How well they work is a surprisingly nuanced question, but for this
> we can assume it's completely described by what block-level loss rate
> the code can recover from, for a specified file size and success rate.
>  As I see it, there are four fundamental metrics: disk space usage,
> disk io (in particular seeks), ram usage, and CPU usage.  Different
> FEC schemes have different characteristics, and allow different
> tradeoffs to be made.
> 
> Our current FEC (simple segments, with Reed-Solomon encoding of each
> segment) does very well on the disk performance.  I haven't examined
> what it actually does, but it could be made to alternately read and
> write sequential 4 MiB blocks, making one pass over the whole file,
> without needing any additional space; this is as good possible.  It
> does fairly well on memory usage: it needs to hold a whole 4 MiB
> segment in ram at a time, plus a small amount of overhead for lookup
> tables and Vandermonde matrices and such.  CPU performance is poor:
> decoding each block requires operations on the entire segment, and
> those operations are table lookups rather than simple math.  (Decode
> CPU cost is O(n^2) with segment size, and our segments are big enough
> that this is relevant.)

We can deploy the 64-bit native FEC to improve CPU usage by a significant 
factor (IIRC 3x?). And next year's new CPUs will allow us to avoid the lookup 
table as I understand it.

With regards to disk usage, our current code does not interleave as you 
suggest; each block is stored separately, and it creates new buckets for newly 
decoded blocks. Worse, we stripe to avoid memory usage - this is leftover from 
0.5 days and we should get rid of it. Interleaving would of course give better 
performance and disk usage, but it would take significant work. A complicating 
factor is healing, which requires us to encode after decoding, although this 
doesn't prevent interleaving in that check blocks we managed to fetch already 
can be safely discarded. A further complicating factor is binary blobs, but 
again this can be managed.

Bugs need to be filed for all these optimisations - IF our current FEC remains 
an important part of Freenet. But it won't. It will only be used for small 
files, because for large files segmentation makes redundancy performance very 
poor.
> 
> Other schemes will likely make different tradeoffs.  A naive LDPC
> implementation will use very little RAM and CPU, but do lots of disk
> seeks and need space to store (potentially) all data and check blocks
> of the file on disk during that time (that is, double the file size,
> where the current RS codes only need space equal to the final file
> size).  However, it also allows ways to trade off more memory usage
> and CPU time for less disk io and (I think) less disk space usage.  

Currently we use a lot of disk space. This can be greatly optimised, but I'm 
not sure using up to 2x the file size temporarily during decode (most of which 
will have to happen at the end) is a big deal.

What is a big deal is that while LDPC allows for some partial decoding, most of 
the decode will have to be done at the end. Whereas in simple segmented RS 
codes segments are decoded independantly while the download goes on. And 
presumably interleaved segments, while they will have some cascading, will have 
a significant amount of the decoding happen in parallel with downloading - 
especially for popular files. Decodes which are not on the critical path are 
not a problem IMHO - it's the big, blocking burst of heavy disk I/O at the end 
that is the larger issue.

Specifically, the ongoing decodes involve seeking, but not a lot of seeking - 
an average of 5 blocks, each of which can be read sequentially, one of which we 
already have buffered (at the OS level if not at the freenet level) because we 
just downloaded it or we are cascading in some small way. Okay, this means 6 
seeks (2 writes) instead of 1 on downloading a block, and that might become a 
problem with cheap disks on faster network connections - but it is interesting 
to note that the current datastore does 5 seeks on writing a block (but this 
can also be radically optimised).

Of course, segmentation solves many of these problems, as well as making 
streaming easier - but segmentation is unacceptable because it dramatically 
reduces success rates. Unless we could have some sort of hybrid scheme where 
most but not all of the blocks we need are within the pseudo-segment...

An interesting side-issue: LDPC decoding can

[freenet-dev] FEC and memory usage

2010-04-14 Thread Evan Daniel
I've been investigating potential improvements to our FEC encoding in
my spare time (in particular the use of LDPC codes and their
relatives, but the following is generally applicable).  I'd like to
ask for opinions on what assumptions I should be making about
acceptable levels of CPU time, memory usage, and disk usage.

We care both about how well our FEC codes work, and how fast they are.
 How well they work is a surprisingly nuanced question, but for this
we can assume it's completely described by what block-level loss rate
the code can recover from, for a specified file size and success rate.
 As I see it, there are four fundamental metrics: disk space usage,
disk io (in particular seeks), ram usage, and CPU usage.  Different
FEC schemes have different characteristics, and allow different
tradeoffs to be made.

Our current FEC (simple segments, with Reed-Solomon encoding of each
segment) does very well on the disk performance.  I haven't examined
what it actually does, but it could be made to alternately read and
write sequential 4 MiB blocks, making one pass over the whole file,
without needing any additional space; this is as good possible.  It
does fairly well on memory usage: it needs to hold a whole 4 MiB
segment in ram at a time, plus a small amount of overhead for lookup
tables and Vandermonde matrices and such.  CPU performance is poor:
decoding each block requires operations on the entire segment, and
those operations are table lookups rather than simple math.  (Decode
CPU cost is O(n^2) with segment size, and our segments are big enough
that this is relevant.)

Other schemes will likely make different tradeoffs.  A naive LDPC
implementation will use very little RAM and CPU, but do lots of disk
seeks and need space to store (potentially) all data and check blocks
of the file on disk during that time (that is, double the file size,
where the current RS codes only need space equal to the final file
size).  However, it also allows ways to trade off more memory usage
and CPU time for less disk io and (I think) less disk space usage.  An
interleaved segments code based on RS codes (like the CIRC code used
on CDs) would be worse than our current scheme (equivalent memory
usage, poor CPU performance, slightly more disk space required, a
moderate number of disk seeks required).  (Both LDPC and interleaved
segments are more effective than our current scheme for large files.)

So, given that the tradeoffs will be complex, and that the decoder is
likely to have some flexibility (eg more memory usage for fewer
seeks), what baseline assumptions about these should I be making?  Do
we care more about reducing the number of seeks, even if it has an
increased cost in memory usage or CPU time?  How much memory is it
safe to assume will always be available?  Is it ok to need disk space
beyond the file size?  What if avoiding that has a significant cost in
CPU time?

I realize these are fairly vague questions; vague and opinion-based
answers are certainly welcome.  Hopefully it wont be too long before I
can toss some example numbers into the discussion.

Evan Daniel



Re: [freenet-dev] FEC and memory usage

2010-04-14 Thread Matthew Toseland
On Wednesday 14 April 2010 17:52:10 Evan Daniel wrote:
> I've been investigating potential improvements to our FEC encoding in
> my spare time (in particular the use of LDPC codes and their
> relatives, but the following is generally applicable).  I'd like to
> ask for opinions on what assumptions I should be making about
> acceptable levels of CPU time, memory usage, and disk usage.
> 
> We care both about how well our FEC codes work, and how fast they are.
>  How well they work is a surprisingly nuanced question, but for this
> we can assume it's completely described by what block-level loss rate
> the code can recover from, for a specified file size and success rate.
>  As I see it, there are four fundamental metrics: disk space usage,
> disk io (in particular seeks), ram usage, and CPU usage.  Different
> FEC schemes have different characteristics, and allow different
> tradeoffs to be made.
> 
> Our current FEC (simple segments, with Reed-Solomon encoding of each
> segment) does very well on the disk performance.  I haven't examined
> what it actually does, but it could be made to alternately read and
> write sequential 4 MiB blocks, making one pass over the whole file,
> without needing any additional space; this is as good possible.  It
> does fairly well on memory usage: it needs to hold a whole 4 MiB
> segment in ram at a time, plus a small amount of overhead for lookup
> tables and Vandermonde matrices and such.  CPU performance is poor:
> decoding each block requires operations on the entire segment, and
> those operations are table lookups rather than simple math.  (Decode
> CPU cost is O(n^2) with segment size, and our segments are big enough
> that this is relevant.)

We can deploy the 64-bit native FEC to improve CPU usage by a significant 
factor (IIRC 3x?). And next year's new CPUs will allow us to avoid the lookup 
table as I understand it.

With regards to disk usage, our current code does not interleave as you 
suggest; each block is stored separately, and it creates new buckets for newly 
decoded blocks. Worse, we stripe to avoid memory usage - this is leftover from 
0.5 days and we should get rid of it. Interleaving would of course give better 
performance and disk usage, but it would take significant work. A complicating 
factor is healing, which requires us to encode after decoding, although this 
doesn't prevent interleaving in that check blocks we managed to fetch already 
can be safely discarded. A further complicating factor is binary blobs, but 
again this can be managed.

Bugs need to be filed for all these optimisations - IF our current FEC remains 
an important part of Freenet. But it won't. It will only be used for small 
files, because for large files segmentation makes redundancy performance very 
poor.
> 
> Other schemes will likely make different tradeoffs.  A naive LDPC
> implementation will use very little RAM and CPU, but do lots of disk
> seeks and need space to store (potentially) all data and check blocks
> of the file on disk during that time (that is, double the file size,
> where the current RS codes only need space equal to the final file
> size).  However, it also allows ways to trade off more memory usage
> and CPU time for less disk io and (I think) less disk space usage.  

Currently we use a lot of disk space. This can be greatly optimised, but I'm 
not sure using up to 2x the file size temporarily during decode (most of which 
will have to happen at the end) is a big deal.

What is a big deal is that while LDPC allows for some partial decoding, most of 
the decode will have to be done at the end. Whereas in simple segmented RS 
codes segments are decoded independantly while the download goes on. And 
presumably interleaved segments, while they will have some cascading, will have 
a significant amount of the decoding happen in parallel with downloading - 
especially for popular files. Decodes which are not on the critical path are 
not a problem IMHO - it's the big, blocking burst of heavy disk I/O at the end 
that is the larger issue.

Specifically, the ongoing decodes involve seeking, but not a lot of seeking - 
an average of 5 blocks, each of which can be read sequentially, one of which we 
already have buffered (at the OS level if not at the freenet level) because we 
just downloaded it or we are cascading in some small way. Okay, this means 6 
seeks (2 writes) instead of 1 on downloading a block, and that might become a 
problem with cheap disks on faster network connections - but it is interesting 
to note that the current datastore does 5 seeks on writing a block (but this 
can also be radically optimised).

Of course, segmentation solves many of these problems, as well as making 
streaming easier - but segmentation is unacceptable because it dramatically 
reduces success rates. Unless we could have some sort of hybrid scheme where 
most but not all of the blocks we need are within the pseudo-segment...

An interesting side-issue: LDPC decoding can

[freenet-dev] FEC and memory usage

2010-04-14 Thread Evan Daniel
I've been investigating potential improvements to our FEC encoding in
my spare time (in particular the use of LDPC codes and their
relatives, but the following is generally applicable).  I'd like to
ask for opinions on what assumptions I should be making about
acceptable levels of CPU time, memory usage, and disk usage.

We care both about how well our FEC codes work, and how fast they are.
 How well they work is a surprisingly nuanced question, but for this
we can assume it's completely described by what block-level loss rate
the code can recover from, for a specified file size and success rate.
 As I see it, there are four fundamental metrics: disk space usage,
disk io (in particular seeks), ram usage, and CPU usage.  Different
FEC schemes have different characteristics, and allow different
tradeoffs to be made.

Our current FEC (simple segments, with Reed-Solomon encoding of each
segment) does very well on the disk performance.  I haven't examined
what it actually does, but it could be made to alternately read and
write sequential 4 MiB blocks, making one pass over the whole file,
without needing any additional space; this is as good possible.  It
does fairly well on memory usage: it needs to hold a whole 4 MiB
segment in ram at a time, plus a small amount of overhead for lookup
tables and Vandermonde matrices and such.  CPU performance is poor:
decoding each block requires operations on the entire segment, and
those operations are table lookups rather than simple math.  (Decode
CPU cost is O(n^2) with segment size, and our segments are big enough
that this is relevant.)

Other schemes will likely make different tradeoffs.  A naive LDPC
implementation will use very little RAM and CPU, but do lots of disk
seeks and need space to store (potentially) all data and check blocks
of the file on disk during that time (that is, double the file size,
where the current RS codes only need space equal to the final file
size).  However, it also allows ways to trade off more memory usage
and CPU time for less disk io and (I think) less disk space usage.  An
interleaved segments code based on RS codes (like the CIRC code used
on CDs) would be worse than our current scheme (equivalent memory
usage, poor CPU performance, slightly more disk space required, a
moderate number of disk seeks required).  (Both LDPC and interleaved
segments are more effective than our current scheme for large files.)

So, given that the tradeoffs will be complex, and that the decoder is
likely to have some flexibility (eg more memory usage for fewer
seeks), what baseline assumptions about these should I be making?  Do
we care more about reducing the number of seeks, even if it has an
increased cost in memory usage or CPU time?  How much memory is it
safe to assume will always be available?  Is it ok to need disk space
beyond the file size?  What if avoiding that has a significant cost in
CPU time?

I realize these are fairly vague questions; vague and opinion-based
answers are certainly welcome.  Hopefully it wont be too long before I
can toss some example numbers into the discussion.

Evan Daniel
___
Devl mailing list
Devl@freenetproject.org
http://osprey.vm.bytemark.co.uk/cgi-bin/mailman/listinfo/devl