Re: [Tech] Chunk Size?

Gordan Bobic Thu, 23 Jan 2003 15:51:48 -0800

On Thursday 23 Jan 2003 5:18 pm, Matthew Toseland wrote:

> > Are files never separated into segments, unless FEC is used? What are the
> > minimum and maximum sizes for FEC segments?
>
> It is possible to insert non-redundant splitfiles. They are unreliable
> and slow. FEC splitfiles use "chunks" (segments are something else :)).


OK, terminology noted. :-)

> Fproxy uses 256kB to 1MB chunks, but other clients could use other
> sizes. That is however the recommended range for most uses.

...

> Splitfiles therefore can fail if too many of the chunks are no longer
> fetchable.

Is it possible to use smaller chunks? Can you give me a link to a document 
that explains how to control the use of FEC via fproxy? For example, can I 
force the use of FEC for files smaller than 1 MB?

> > > However if you need to store
> > > chunks of more than a meg, you need to use redundant (FEC) splitfiles.
> >
> > As I said, I was looking for the limits on the small size, rather than
> > the large size. Now I know not to go much below 1 KB because it is
> > pointless. I doubt I'd ever need to use anything even remotely
> > approaching 1 MB for my application. I was thinking about using a size
> > between 1 KB and 4 KB, but wasn't sure if the minimum block size might
> > have been something quite a bit larger, like 64 KB.
>
> Well... I don't know. You gain performance from downloading many files
> at once (don't go completely over the top though... the splitfile
> downloader uses around 10)

Doesn't this depend entirely on the limits on the number of threads and 
concurrent connections, as set in the configuration file? And the hardware 
and network resources of course.

> - but it means you insert more data, and you
> have to decode it; it's not designed for such small chunks, but we know
> it can work with sizes close to that from work on streaming... The
> overheads on a 1kB CHK are significant (something like 200 bytes?), I'd
> use 4kB chunks, at least...

Is this overhead included in the amount of space consumed? i.e. does this mean 
that 1 KB file + 200 bytes of overhead => 2KB of storage? Or is the overhead 
completely separate?

Is the overhead of 200 bytes fixed for all file sizes, or does it vary with 
the file size?

> > > > The reason for this is that I am trying to design a database
> > > > application that uses Frenet as the storage medium (yes, I know about
> > > > FreeSQL, and it doesn't do what I want in the way I want it done).
> > > > Files going missing are an obvious problem that needs to be tackled.
> > > > I'd like to know what the block size is in order to implement
> > > > redundancy padding in the data by exploiting the overheads produced
> > > > by the block size, when a single item of data is smaller than the
> > > > block that contains it.
>
> You do know that Freenet is lossy, right? Content which is not accessed
> very much will eventually expire.

Yes, this is why I am thinking about using DBR. There would potentially be a 
number of nodes that would once per day retrieve the data, compact it into 
bigger files, and re-insert it for the next day. This would be equivalent to 
vacuum (PostgreSQL) and optimize (MySQL) commands.

The daily operation data would involve inserting many small files (one file 
per record in a table, one file per delete flag, etc.)

This would all be gathered, compacted, and re-inserted. Any indices would also 
get re-generated in the same way.

> > > > This could be optimized out in run-time to make no impact on
> > > > execution speed (e.g. skip downloads of blocks that we can
> > > > reconstruct from already downloaded segments).
> > >
> > > Hmm. Not sure I follow.
> >
> > A bit like a hamming code, but allowing random access. Because it is
> > latency + download that is slow, downloading fewer files is a good
> > thing for performance, so I can re-construct some of the pending segments
> > rather than downloading them. Very much like FEC, in fact. :-)
>
> Latency is slow. Downloading many files in series is slow. Downloading
> many files in parallel, as long as you don't get bogged down waiting for
> the last retry on the last failing block in a non-redundant splitfile,
> is relatively fast. By all means use your own codes!

I haven't decided what to use for redundancy yet. My biggest reason for using 
my own method is that it would allow me to pad files to a minimal sensible 
size (I was thinking about 4 KB), and enable me to skip chunks that are not 
needed, or back-track to reconstruct a "hole" in the data from the files that 
are already there.

But FEC is very appealing because it already does most of that, so there would 
be less work involved in the implementation of my application.

> > Of course, I might not bother if I can use FEC for it instead, provided
> > it will work with very small file sizes (question I asked above).
>
> Well...
>
> FEC divides the file into segments of up to 128 chunks (I think).
> It then creates 64 check blocks for the 128 chunks (obviously fewer if
> fewer original chunks), and inserts the lot, along with a file
> specifying the CHKs of all the different chunks inserted for each
> segment.

Doesn't that mean that with maximum chunk size of 1 MB, this limits the file 
size to 128 MB? Or did I misunderstand the maximum chunk size, and it is 
purely a matter of caching as a factor of the store size?

What is the smallest file size with which FEC can be use sensibly? Would I be 
correct in quessing this at 2 KB, to create 2 1KB chunks with 1 1KB check 
block?

Is FEC fixed at 50% redundancy, or can the amount of redundancy be controlled 
(e.g. reduced to 25%, if requested)? Or has it been tried and tested that 
around 50% gives best results?

Thanks.

Gordan

_______________________________________________
Tech mailing list
[EMAIL PROTECTED]
http://hawk.freenetproject.org/cgi-bin/mailman/listinfo/tech

Re: [Tech] Chunk Size?

Reply via email to