Re: [Tech] Chunk Size?

Matthew Toseland Thu, 23 Jan 2003 09:18:51 -0800

On Thu, Jan 23, 2003 at 04:54:37PM +0000, Gordan Bobic wrote:
> On Thu, 23 Jan 2003, Matthew Toseland wrote:
> 
> > On Thu, Jan 23, 2003 at 11:31:25AM +0000, Gordan Bobic wrote:
> > > Hi.
> > > 
> > > Is there such a thing as chunk size in the way Freenet deals with storing 
> > > and transfering the data?
> > 
> > Freenet key contents are normally some fields plus a power of two size
> > data. Minimum 1kB, I believe.
> 
> Is there a level of segmentation performed by default? Say, if the file is 
> 65KB, will it be stored as a single 65KB file, and always accessed as a 
> SINGLE entity, or will it be segmented by the network into multiple powers 
> of two size chunks (minimum 1 KB)?
All segmentation is done by client software. The node/network do not know and
do not care which files are splitfiles. As far as defaults, Fproxy by
default inserts files of 1MB or more as redundant splitfiles, and
anything less as a single key.
> 
> > > By chunk size, I mean a minimal unit of data that is worked with. For 
> > > example, for disk access, this would be the size of the inode.
> > 
> > The other interesting parameter here is the default datastore size.
> > Freenet nodes should generally not have a store less than a little over
> > 200MB. The default is 256MB. The maximum size chunk that a datastore
> > will store is 1/200th of the store size. Hence, a file split into 1MB
> > chunks (or less) will be cacheable on all nodes it goes through.
> 
> That doesn't bother me, because I am trying to optimize things down to a 
> _minimum_ sensible file size, rather than maximum possible file size.
> 
> Are files never separated into segments, unless FEC is used? What are the 
> minimum and maximum sizes for FEC segments?
It is possible to insert non-redundant splitfiles. They are unreliable
and slow. FEC splitfiles use "chunks" (segments are something else :)).
Fproxy uses 256kB to 1MB chunks, but other clients could use other
sizes. That is however the recommended range for most uses.
> 
> > > I am trying to determine what is the optimal size to split data into. The 
> > > size I am looking for is the one that implies that (file size) == (block 
> > > size), so that if a block gets lost, the whole file (not just a part of 
> > > it) is gone.
> > 
> > Hmm. Use a power of two. A file and a block are no different on
> > freenet; we delete the files when they reach the end of the LRU on a
> > full datastore and we need more space.
> 
> I understand that. I just wanted to know if files are always treated as a 
> single entity (except concerning FEC), or if they were always segmented 
> into blocks, and different blocks could come from different nodes (again, 
> without using FEC).
Keys with data in them are single entities on the node and on the
network. It knows nothing whatsoever about splitfiles.


Splitfiles therefore can fail if too many of the chunks are no longer
fetchable.
> 
> > However if you need to store
> > chunks of more than a meg, you need to use redundant (FEC) splitfiles.
> 
> As I said, I was looking for the limits on the small size, rather than the 
> large size. Now I know not to go much below 1 KB because it is pointless. 
> I doubt I'd ever need to use anything even remotely approaching 1 MB for 
> my application. I was thinking about using a size between 1 KB and 4 KB, 
> but wasn't sure if the minimum block size might have been something 
> quite a bit larger, like 64 KB.
Well... I don't know. You gain performance from downloading many files
at once (don't go completely over the top though... the splitfile
downloader uses around 10) - but it means you insert more data, and you
have to decode it; it's not designed for such small chunks, but we know
it can work with sizes close to that from work on streaming... The
overheads on a 1kB CHK are significant (something like 200 bytes?), I'd
use 4kB chunks, at least...
> 
> > > The reason for this is that I am trying to design a database application 
> > > that uses Frenet as the storage medium (yes, I know about FreeSQL, and it 
> > > doesn't do what I want in the way I want it done). Files going missing are 
> > > an obvious problem that needs to be tackled. I'd like to know what the 
> > > block size is in order to implement redundancy padding in the data by 
> > > exploiting the overheads produced by the block size, when a single item of 
> > > data is smaller than the block that contains it.
You do know that Freenet is lossy, right? Content which is not accessed
very much will eventually expire.
> > 
> > Cool. See above.
> 
> When you say powers of two - does that mean that a 5 KB file will be 
> rounded to 8 KB? Or be split into 4 KB and 1 KB? If it is split, at what 
Rounded.
> point is it split, and how is this handled on the storage and network 
> transfer levels?
A 5MB file would be split, but because of redundancy you'd end up
inserting ~ 7.5MB worth of data anyway :).
> 
> > > This could be optimized out in run-time to make no impact on execution 
> > > speed (e.g. skip downloads of blocks that we can reconstruct from already 
> > > downloaded segments).
> > 
> > Hmm. Not sure I follow.
> 
> A bit like a hamming code, but allowing random access. Because it is 
> latency + download that is slow, downloading fewer files is a good 
> thing for performance, so I can re-construct some of the pending segments 
> rather than downloading them. Very much like FEC, in fact. :-)
Latency is slow. Downloading many files in series is slow. Downloading
many files in parallel, as long as you don't get bogged down waiting for
the last retry on the last failing block in a non-redundant splitfile,
is relatively fast. By all means use your own codes!
> 
> Of course, I might not bother if I can use FEC for it instead, provided it 
> will work with very small file sizes (question I asked above).
Well...

FEC divides the file into segments of up to 128 chunks (I think).
It then creates 64 check blocks for the 128 chunks (obviously fewer if
fewer original chunks), and inserts the lot, along with a file
specifying the CHKs of all the different chunks inserted for each
segment.

The client then fetches the file with all the CHKs in, and randomly
fetches blocks from the 192 blocks available, until it has all 128
blocks, or it runs out of blocks to (re-) try.

We cannot reconstruct ANY of the segment until we have 128 blocks. But
when we have any 128 blocks, we can reconstruct the original 128, and
the checkblocks. Hence, recently we have introduced automatic
reinsertion of a few of the failed blocks, assuming we got the whole
file eventually but a few blocks failed first.

We use the Onion Networks FEC code.

I would be quite interested in alternative codes.

The node, and the network, works on the basis of fetching, and caching,
whole "files" - keys, not the files that get split up - and although it
streams them, they are cached as atomic blocks - they are not available
to other requests until the whole file has transferred. It is not
possible to fetch a part of a key, only the whole key. Obviously you can
do what you like with the data once you have it.
> 
> Thanks.
> 
> Gordan
> 

-- 
Matthew Toseland
[EMAIL PROTECTED][EMAIL PROTECTED]
Full time freenet hacker.
http://freenetproject.org/
Freenet Distribution Node (temporary) at http://amphibian.dyndns.org:8889/I3mGXPd6zTA/
ICTHUS.

msg01051/pgp00000.pgp
Description: PGP signature

Re: [Tech] Chunk Size?

Reply via email to