Re: Filesystem holes

2000-10-31 Thread John Summerfield


> > :> actually being used, while providing instant lookups.
> > :> 
> > :> The single file would be about 96G addressable bytes... But the actual
> > :> block count would be much lower.  I suppose I will have to create a seri
> es
> > :> of these files and divide the problem into < 4GB chunks, but one
> > :> lookup/store will still be done with one calculation and one disk seek
> > :> (just more filehandles open).
> > :> 
> > :> Deletes seem problematic.  My question is, does the operating system
> > :> provide any way to free blocks used in the middle of a file?
> > :> 
> > :> Must I periodically re-create these files (long and slow process, but no
> t
> > :> so bad if done infrequently) to reclaim space, or is there a way to free
> > :> arbitrary blocks in a file in userland (WITHOUT trashing the filesystem?
> > :> :-)
> > :> 
> > :> - Ryan
> > :> 
> > :> -- 
> > :>   Ryan Thompson <[EMAIL PROTECTED]>
> > :>   Network Administrator, Accounts
> > :>   Phone: +1 (306) 664-1161
> > :> 
> > :>   SaskNow Technologies http://www.sasknow.com
> > :>   #106-380 3120 8th St E   Saskatoon, SK  S7H 0W2
> > 
> > I would strongly recommend using a B+Tree instead of a hash table.  Wit
> h
> > a hash table slightly different lookups will seek all over the place.
> > With a B+Tree lookups will stay more localized.
> 
> Right... That's a good point, but (and, no, I really didn't mention this
> in my original post), "sequential" access is never important with the
> system I have described.  Lookups are always random, and they are almost
> always done one at a time (multiple lookups only done for data swaps...
> very rare, like 1/1e7 accesses... and those are still O(1) (well,
> O(2), but who's counting ;-)))...


> Remember, though, I'm not proposing a hash table.  I have been lucky
> enough to design a "hash" function that will crunch down all possible
> values into about 64M unique keys--so, no collisions.  I propose to use
> this function with address calculation.  So, O(1) everything, for single
> random lookups, which is (virtually) all this data structure must deal
> with.  Sorting isn't necessary, and that could be accomplished in O(n)
> anyway.
> 


Way to use has for variable-length records:

Use an intermediate 'file" whose records are simply pointers tot the location 
of the real data.

Finding a record goes like this:

ACnumber=getRecord("KEY");
realRecordNumber=AC(ACnumber);

then read record number realRecordNumber.

AC I pinched from Adabas which (back in the early 80s, dunno what it does now) 
had an address covertor. It changed record numers (returned by getRecord() 
here) to a disk address. Its size is n, where n is the number of records you 
support.

In data storage (andother Adabas term), you have fixed-length records, each 
containing several records.

Adabas stored the Adabas record number and the data files; numbers stored as 
binary, character fields (usually) preceded with a (one-byte) record size so 
trailing spaces could be dropped.

Inserted records go into empty blocks (except in load mode where blocks are 
filled to a user-specified percentage) (and presumably into a block in the 
buffer pool with space).

Blocks are compressed before being rewritten - all the data is packed to the 
front. The first field in the DS block is the offset to free space in the 
block.

Adabas alsh had a work dataset, where data are recorded pending end of 
transaction; EOT can be acknowldged when the information is safely recorded in 
work as Adabas can always recover from there. It also has transaction logging, 
either too tape (alternating  with two drives) or two files on disk.

















To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Filesystem holes

2000-10-31 Thread Matt Dillon

:In my case I'd be better off with shared memory objects that aren't
:persistent but appear in the name space so that I don't accidentally
:start copying a virtual bus file when the programs exit improperly.
:In the sparse matrix calculations with no checkpointing or need to appear
:in a name space I'd think the best thing would be to use VM with the matrix
:initially mapped to a copy on write zero page.  I guess you can't
:do that without mmap because of swap allocation.
:
:Peter
:
:--
:Peter Dufault ([EMAIL PROTECTED])   Realtime development, Machine control,

If you can fit the whole thing into a process's VM space, then you
can certainly use mmap(...MAP_ANON...) combined with madvise(... MADV_FREE)
to implement a sparse memory object.

There was talk a while ago about feeding MADV_FREE through to the
filesystem layer.  I was under the impression that SUN does that, does
anyone know?

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Filesystem holes

2000-10-31 Thread Matt Dillon

:to hold a write lock on the range you didn't want rewritten; so
:long as it honors the advisory locks, there'd be no chance of it
:screwing up, unless you got bit by the stupid POSIX lock close
:semantics.  Stupid POSIX; that's the other one I'd put in: the
:ability to:
:
:   int i = 1;
:
:   fcntl( fd, F_NONPOSIX, &i);
:
:It would help out the NFS locking daemon to no end...
:
:   Terry Lambert
:   [EMAIL PROTECTED]

We could implement this trivially, and I'm pretty sure we could
implement some sort of free-space semantics trivially too, at least
for UFS, using a struct flock to pass the parameters to a fcntl.

-Matt



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Filesystem holes

2000-10-31 Thread Peter Dufault

> > I use them for bus simulations, which also are permanently sparse.
> > It would be nice to free up the regions when I "remove" a virtual
> > board, but in a check through POSIX I could find nothing defined to
> > behave that way either for mapped files or mapped memory objects.
> 

(...)

> 
> POSIX was unwilling to take a stand on the F_FREESP vs. ftruncate
> war, and so they remained agnostic, and didn't document anything.

No, IIRC POSIX defines ftruncate behavior both for mapped files
and shared memory objects but that isn't much
use for freeing up holes unless you want to repopulate chunks after
the hole.  All in all I agree with Matt about the suitability
of large sparse files for data storage.  My case is different in that
I want to test object code in pretty much its final form and legacy
code will be full of brute force address calculations.

...

> Peter: for your rewriting problem, I think you could just decide
> to hold a write lock on the range you didn't want rewritten; so
> long as it honors the advisory locks, there'd be no chance of it
> screwing up, unless you got bit by the stupid POSIX lock close
> semantics.  Stupid POSIX; that's the other one I'd put in: the
> ability to:

I never thought of that.  I'll look at it.  I'll have to see how
it is defined in POSIX and how it plays with shared memory objects
on Solaris - currently I have no differences in the code other than
defining shmopen to be (errno = ENOSYS, -1) if shared memory objects
aren't supported and I fall back to mapped files and it all works
well.  One unfortunateness, though, is that Solaris requires that
shared memory objects have the name of a file in "/" where I
typically don't want to place multi gigabyte files even when they
are allegedly sparse, requiring placing shared memory objects and
shared files in different places and also having names
such as "/vme_pid7662_data64" since I can't have subdirs.

Peter

--
Peter Dufault ([EMAIL PROTECTED])   Realtime development, Machine control,
HD Associates, Inc.   Fail-Safe systems, Agency approval


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Filesystem holes

2000-10-31 Thread Terry Lambert

> I use them for bus simulations, which also are permanently sparse.
> It would be nice to free up the regions when I "remove" a virtual
> board, but in a check through POSIX I could find nothing defined to
> behave that way either for mapped files or mapped memory objects.

SVR4 defined F_FREESP; if passed to fcntl, it will call the
VOP_SPACE routine.  The third argument is a struct flock, and
it used to be that it only supported l_len == 0, but that
changed in 1993/1994, about the same time we got sfork
support flogged into it past the USL Process (caps intentional).

1993 was too late to get either interface fully documented in
"The Magic Garden Explained", unfortunately, but it's been in
SVR4 (along with sfork, via an ioctl against /proc), since
back then.

POSIX was unwilling to take a stand on the F_FREESP vs. ftruncate
war, and so they remained agnostic, and didn't document anything.

FWIW, the SVR4 interface is better, since it allows you to open
holes.  It was surprisingly hard to get simple changes like this
past the keepers of the keys to the USL source tree.  After we
found out why, it became more understandable: the USL source code
is built like a trinary nerve gas weapon, and you have to touch
all three parts to get a change into the code.  It's really rather
arranged So That Things Will Not Change.

My preference would be to hook it into fcntl, with F_FREESP, with
the extended interface that can do more than truncate.  This would
require adding a "VOP_SPACE" type thing.

There was also an undocumented F_ALLOCSP; I don't remember if that
got folded in with the rest of the code, or if it got left out,
but it does the obvious thing: allocated backing pages.

Peter: for your rewriting problem, I think you could just decide
to hold a write lock on the range you didn't want rewritten; so
long as it honors the advisory locks, there'd be no chance of it
screwing up, unless you got bit by the stupid POSIX lock close
semantics.  Stupid POSIX; that's the other one I'd put in: the
ability to:

int i = 1;

fcntl( fd, F_NONPOSIX, &i);

It would help out the NFS locking daemon to no end...


Terry Lambert
[EMAIL PROTECTED]
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Filesystem holes

2000-10-31 Thread Peter Dufault

> ...  Sparse matrixes are the big math problem
> that benefit, but only because the solution to a sparse matrix problem
> is not even close to random so the sparse matrix winds up still being
> sparse all the way to the end of the solution.

I use them for bus simulations, which also are permanently sparse.
It would be nice to free up the regions when I "remove" a virtual
board, but in a check through POSIX I could find nothing defined to
behave that way either for mapped files or mapped memory objects.
Also, a write from any process would repopulate the region which
I really wouldn't want but I don't see that level of control
over mapping between unrelated processes  (Now I start thinking
about MAPFIXED to a specified virtual address and implementing funny
tricks but I don't have time to work on that).

In my case I'd be better off with shared memory objects that aren't
persistent but appear in the name space so that I don't accidentally
start copying a virtual bus file when the programs exit improperly.
In the sparse matrix calculations with no checkpointing or need to appear
in a name space I'd think the best thing would be to use VM with the matrix
initially mapped to a copy on write zero page.  I guess you can't
do that without mmap because of swap allocation.

Peter

--
Peter Dufault ([EMAIL PROTECTED])   Realtime development, Machine control,
HD Associates, Inc.   Fail-Safe systems, Agency approval


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Filesystem holes

2000-10-31 Thread Matt Dillon


:
:> Ahh.. yes, I know.  I'm a filesystem expert :-)  However, that said, I
:> will tell you quite frankly that virtually *nobody* depends on holes
:> for efficient storage.  There are only a few problems where it's
:> practical some forms of executables, and sparse matrixes.  That's
:> pretty much it.
:
:Your master password database.  Most sendmail maps.  Anything
:else that uses the Berkeley DB, like message catalog files,
:locales, etc..

Actually less then you think.  Even though DB utilizes the concept
of a sparse file, if you check carefully you will note that your
sendmail maps and password file databases aren't actually all
that sparse.  With an 8K filesystem block size the default DB
multiplier usually results in virtually no sparseness at all.  It
takes tuning to get any sort of sparseness and even then you don't
get much benefit from it.  The only real benefit is in the hash table
size factor ... the hash array may be sparse, but the physical file
underlying it probably isn't.

Not even USENET news history files, which can run into the gigabytes,
actually wind up being that sparse.

Also keep in mind that Berkeley DB is very old, even with all the
rewrites, and the original hash algorithm was chosen for expediency
rather then for the 'best' efficiency.  Our current DB library has
a btree access method, and for big data sets it works a whole lot better
then the hash method.  It doesn't require tuning, for one.

:Frankly, sparse files have a huge number of uses, particularly
:when applied to persistant storage of data of the kind you'd
:see in chapter 5, section 5.4.x and chapter 6 in Knuth's.
:
:Plus your FFS filesystem itself is a sparse matrix.  It'd be
:real useful to be able to "free up holes" in a file, if I
:wanted to use one to do user space work on an FS design, for
:example, a log structured FS, where I wanted to be able to
:experiment with a "cleaner" process that recovered extents.
:
:I'd actually be able to tell real quickly whether it was
:working by just setting an allocation range that I expect
:my iterative testing to stay within (if it goes over or under
:the range while I'm moving stuff around and cleaning at the
:same time, I'll know there's a bug in my daemon).
:
:Personally, I'm not rich enough to be able to burn disk space
:so easily.
:   Terry Lambert
:   [EMAIL PROTECTED]

I agree that sparse files should not be discarded out of hand.
There are very few real problems that need them though.  I can
think of a handful, all very specialized for example, the
VN device uses the concept of a sparse-file (actually a sparse
backing for the filesystem layer), as do Kirk's softupdates
snapshots (I believe).  Sparse matrixes are the big math problem
that benefit, but only because the solution to a sparse matrix problem
is not even close to random so the sparse matrix winds up still being
sparse all the way to the end of the solution.  But these are extremely
specialized problems, not general datastore problems, and nearly
all of these problems are tuned (or inherently) specific to the block
size of the underlying system.  Using a sparse file for general data
store just isn't all that hot an idea, because by its very nature data
store is, well, storing data.  Packing it is usually the way to go.

-Matt



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Filesystem holes

2000-10-31 Thread Terry Lambert

> Ahh.. yes, I know.  I'm a filesystem expert :-)  However, that said, I
> will tell you quite frankly that virtually *nobody* depends on holes
> for efficient storage.  There are only a few problems where it's
> practical some forms of executables, and sparse matrixes.  That's
> pretty much it.

Your master password database.  Most sendmail maps.  Anything
else that uses the Berkeley DB, like message catalog files,
locales, etc..

Frankly, sparse files have a huge number of uses, particularly
when applied to persistant storage of data of the kind you'd
see in chapter 5, section 5.4.x and chapter 6 in Knuth's.

Plus your FFS filesystem itself is a sparse matrix.  It'd be
real useful to be able to "free up holes" in a file, if I
wanted to use one to do user space work on an FS design, for
example, a log structured FS, where I wanted to be able to
experiment with a "cleaner" process that recovered extents.

I'd actually be able to tell real quickly whether it was
working by just setting an allocation range that I expect
my iterative testing to stay within (if it goes over or under
the range while I'm moving stuff around and cleaning at the
same time, I'll know there's a bug in my daemon).

Personally, I'm not rich enough to be able to burn disk space
so easily.


Terry Lambert
[EMAIL PROTECTED]
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Filesystem holes

2000-10-30 Thread Matt Dillon


:The original question still stands, and I'm quite interested in hearing
:an answer.
:
:I think Ryan's looking for an equivalent to Solaris' F_FREESP fcntl; I'm
:not aware that one exists in FBSD - right?
:
:jan
:
:-- 
:jan grant, ILRT, University of Bristol. http://www.ilrt.bris.ac.uk/

No, we don't have anything like that, though our VFS layer could
support it.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Filesystem holes

2000-10-30 Thread Jan Grant

The original question still stands, and I'm quite interested in hearing
an answer.

I think Ryan's looking for an equivalent to Solaris' F_FREESP fcntl; I'm
not aware that one exists in FBSD - right?

jan

-- 
jan grant, ILRT, University of Bristol. http://www.ilrt.bris.ac.uk/
Tel +44(0)117 9287163 Fax +44 (0)117 9287112 RFC822 [EMAIL PROTECTED]
stty intr ^m



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Filesystem holes

2000-10-29 Thread Matt Dillon


:Yes, there's a high probability of that.  It's one of the reasons 
:why people typically use the feature, at least not for 'permanent'
:data sets.

Ahhh... of course I meant 'typically do not use the feature'.  Heh.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Filesystem holes

2000-10-29 Thread Ryan Thompson

Leif Neland wrote to Ryan Thompson and Matt Dillon:

> 
> What will happen, if somebody (possibly you, as mahordomo says), tries to
> make a backup of that file.

Make sure to use a program that can cope ;-)


> Will the copy also be with holes, or would that file suddenly use all 96GB?
> It will at least do so, if one does cat file>file.bak
> Probably tar will do the same.

Actually, tar will handle holes elegantly (this I have tested), with the
--sparse option.  Older versions would not.  cat and other similar
"filters" are naive, as they simply block I/O.

Backing up with tar and/or a filesystem dump would be just as effective as
with any other storage strategy.

cat file > file.bak on even a 2GB file is probably not something that
would be popular, anyway.

> I'd be afraid to create something which could easily blow up by having
> normal operations applied to it.

That's a valid concern.  That's the biggest drawback I see to the overall
strategy... But, specialized problems sometimes encourage specialized
solutions.

> 
> Leif
> 

-- 
  Ryan Thompson <[EMAIL PROTECTED]>
  Network Administrator, Accounts
  Phone: +1 (306) 664-1161

  SaskNow Technologies http://www.sasknow.com
  #106-380 3120 8th St E   Saskatoon, SK  S7H 0W2



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Filesystem holes

2000-10-29 Thread Matt Dillon

:
:Presumably writing into these holes repeatedly is a none-too-efficient 
:affair (requiring moving all the stuff on either side), or am I missing the 
:point slightly?
:
:Dave :)

It isn't quite that bad, but it isn't a walk in the park either.
Since data blocks are referenced from a block table, inserting new
blocks is cheap.  However, this means that data may not be physically
ordered in the file -- if your read crosses a block boundry it may require
an extra seek.  FreeBSD's FFS does have the reallocblks feature turned
on and this will cause the filesystem to try to reorder nearby blocks,
but it was never designed to handle sparse files so

-Matt



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Filesystem holes

2000-10-29 Thread Matt Dillon

:>
:What will happen, if somebody (possibly you, as mahordomo says), tries to
:make a backup of that file.

That will depend on the backup program.  dump/restore can handle
holes just fine.  tar can handle them in a 'fake' way, and you have
to tell it.  Programs like 'cp' cannot handle holes they'll
copy the zero's.

:I'd be afraid to create something which could easily blow up by having
:normal operations applied to it.
:
:Leif

Yes, there's a high probability of that.  It's one of the reasons 
why people typically use the feature, at least not for 'permanent'
data sets.

-Matt



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Filesystem holes

2000-10-29 Thread Ryan Thompson

Matt Dillon wrote to Ryan Thompson:

> :Hi, Matt!  Thanks for the replies.  I'll try and keep you interested ;-)
> :
> :Hmm... Perhaps you're still missing my original point?  I'm talking about
> :a file with 96GB in addressable bytes (well, probably a bunch of files,
> :given logical filesize limitations, but let's say for simplicity's sake
> :that we have a single file).  It's actual "size" (in terms of allocated
> :blocks) will be only a bit larger than 2.1GB.  (Directly proportional to
> :the used size of the dataset.  Discrepancies only come into play when
> :record size != block size, but that can be worked around somewhat)
> 
> Ah, ok... that's better, though I think you will find yourself
> tuning it endlessly if the blocksize does not match-up.  Remember,
> the filesystem allocates 8K per block, so if your record size is
> 1500 bytes and you have a random distribution, you are going to
> wind up eating 8-14GB.

True.. As they say, "Disk is cheap", though... And "tuning" isn't so bad
on a simple address calculation.  I agree that if the record size doesn't
closely match the blocksize, there will be a waste of space, but at least
that waste is proportional to the dataset... Better than many data
structures that I could call by name.  It wouldn't be that difficult with
many problem sets to optimize them to the point where their records align
neatly with the blocksize.  Granted, the waste space would hang you with
some problem sets.  I propose, though, that for some problems, it is easy
enough to tune them to reduce (or, if you're really lucky, eliminate),
waste space.  Yes, this is a specialized solution.


> :In other words, ls -ls will report the "size" as some ridiculously large
> :number, will show a much smaller block count.  So, assuming four records
> :are added to the file on block boundaries, the file will actually only use
> :four blocks... nowhere near 96GB!
> :
> :In the UNIX filesystem (ya, I know.. just pick one :-), size of file !=
> :space allocated for file.  Thus, my original questions were centered
> :around filesystem holes.  I.e., non-allocated chunks in the middle of a
> :file.  When trying to READ from within a hole, the kernel just sends back
> :a buffer of zeros... which is enough to show that the record is not
> :initialized.  Actually, something like an "exists" function for a record
> :wouldn't touch the disk at all!  When writing to a hole, the kernel simply
> :allocates the necessary block(s).  This is really fast, too, for creation,
> :as the empty set can be written to disk with touch(1), and uses far less
> :memory than virtual initialization or memory structures ;-)
> :
> :As an example, try 
> 
> Ahh.. yes, I know.  I'm a filesystem expert :-)  

I know, Matt, and that's why it's great to talk to you about this :-)  I
guess I needed an example to convey my point, though.  I apologize if it
came across as an insult to your intelligence; 'twas not intended that
way in the least.


>   However, that said, I
> will tell you quite frankly that virtually *nobody* depends on holes
> for efficient storage.  

Ahh.  Ok.  This is the kind of response I was looking for.  Case studies
:-)

> There are only a few problems where it's
> practical some forms of executables, and sparse matrixes.  That's
> pretty much it.


> :And analyze the reported filesize, as well as the reported block count of
> :the file.  You should see a 2GB "size", and maybe 8K in allocated blocks
> :(depends how you ran newfs ;-).  This is behaviour that has been around
> :since the original UFS.
> 
> It's not a good idea to use a small block size in UFS.  The minimum
> is pretty much 1K frag, 8K block.  for a sparse file this means 8K
> is the minimum that will be allocated (frags are only allocated for
> the end of a file).

Right, which is why this strategy _does_ only work for a specific subset
of problems, just as a directed graph works well for a path structure, but
would really suck for (among other things) maintaining a sorted list of
account numbers.


> If you think the btree idea is too complex to implement quickly, then
> try using a radix tree (essentially what you said in regards to 
> breaking the hash you calculate into a node traversal).  For your
> problem I would recommend 64 elements per node, which corresponds to
> 6 bits.  16 million records would fit in 6x4 = 4 levels.  If you
> cache the first three levels in memory you eat 64^3 = 262144 x
> sizeof(record).  Assuming a simple file offset for the record, you eat
> exactly 1MB of memory, 64MB for the file index, and yo

Re: Filesystem holes

2000-10-29 Thread Leif Neland


> Hmm... Perhaps you're still missing my original point?  I'm talking about
> a file with 96GB in addressable bytes (well, probably a bunch of files,
> given logical filesize limitations, but let's say for simplicity's sake
> that we have a single file).  It's actual "size" (in terms of allocated
> blocks) will be only a bit larger than 2.1GB.  (Directly proportional to
> the used size of the dataset.  Discrepancies only come into play when
> record size != block size, but that can be worked around somewhat)
>
> In other words, ls -ls will report the "size" as some ridiculously large
> number, will show a much smaller block count.  So, assuming four records
> are added to the file on block boundaries, the file will actually only use
> four blocks... nowhere near 96GB!
>
> In the UNIX filesystem (ya, I know.. just pick one :-), size of file !=
> space allocated for file.  Thus, my original questions were centered
> around filesystem holes.  I.e., non-allocated chunks in the middle of a
> file.  When trying to READ from within a hole, the kernel just sends back
> a buffer of zeros... which is enough to show that the record is not
> initialized.  Actually, something like an "exists" function for a record
> wouldn't touch the disk at all!  When writing to a hole, the kernel simply
> allocates the necessary block(s).  This is really fast, too, for creation,
> as the empty set can be written to disk with touch(1), and uses far less
> memory than virtual initialization or memory structures ;-)
>
What will happen, if somebody (possibly you, as mahordomo says), tries to
make a backup of that file.

Will the copy also be with holes, or would that file suddenly use all 96GB?
It will at least do so, if one does cat file>file.bak
Probably tar will do the same.

I'd be afraid to create something which could easily blow up by having
normal operations applied to it.

Leif





To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Filesystem holes

2000-10-29 Thread David Preece

At 19:16 29/10/00 -0800, you wrote:

> Ahh.. yes, I know.  I'm a filesystem expert :-)  However, that said, I
> will tell you quite frankly that virtually *nobody* depends on holes
> for efficient storage.  There are only a few problems where it's
> practical some forms of executables, and sparse matrixes.  That's
> pretty much it.

Presumably writing into these holes repeatedly is a none-too-efficient 
affair (requiring moving all the stuff on either side), or am I missing the 
point slightly?

Dave :)




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Filesystem holes

2000-10-29 Thread Matt Dillon

:Hi, Matt!  Thanks for the replies.  I'll try and keep you interested ;-)
:
:Hmm... Perhaps you're still missing my original point?  I'm talking about
:a file with 96GB in addressable bytes (well, probably a bunch of files,
:given logical filesize limitations, but let's say for simplicity's sake
:that we have a single file).  It's actual "size" (in terms of allocated
:blocks) will be only a bit larger than 2.1GB.  (Directly proportional to
:the used size of the dataset.  Discrepancies only come into play when
:record size != block size, but that can be worked around somewhat)

Ah, ok... that's better, though I think you will find yourself
tuning it endlessly if the blocksize does not match-up.  Remember,
the filesystem allocates 8K per block, so if your record size is
1500 bytes and you have a random distribution, you are going to
wind up eating 8-14GB.

:In other words, ls -ls will report the "size" as some ridiculously large
:number, will show a much smaller block count.  So, assuming four records
:are added to the file on block boundaries, the file will actually only use
:four blocks... nowhere near 96GB!
:
:In the UNIX filesystem (ya, I know.. just pick one :-), size of file !=
:space allocated for file.  Thus, my original questions were centered
:around filesystem holes.  I.e., non-allocated chunks in the middle of a
:file.  When trying to READ from within a hole, the kernel just sends back
:a buffer of zeros... which is enough to show that the record is not
:initialized.  Actually, something like an "exists" function for a record
:wouldn't touch the disk at all!  When writing to a hole, the kernel simply
:allocates the necessary block(s).  This is really fast, too, for creation,
:as the empty set can be written to disk with touch(1), and uses far less
:memory than virtual initialization or memory structures ;-)
:
:As an example, try 

Ahh.. yes, I know.  I'm a filesystem expert :-)  However, that said, I
will tell you quite frankly that virtually *nobody* depends on holes
for efficient storage.  There are only a few problems where it's
practical some forms of executables, and sparse matrixes.  That's
pretty much it.

:And analyze the reported filesize, as well as the reported block count of
:the file.  You should see a 2GB "size", and maybe 8K in allocated blocks
:(depends how you ran newfs ;-).  This is behaviour that has been around
:since the original UFS.

It's not a good idea to use a small block size in UFS.  The minimum
is pretty much 1K frag, 8K block.  for a sparse file this means 8K
is the minimum that will be allocated (frags are only allocated for
the end of a file).

If you think the btree idea is too complex to implement quickly, then
try using a radix tree (essentially what you said in regards to 
breaking the hash you calculate into a node traversal).  For your
problem I would recommend 64 elements per node, which corresponds to
6 bits.  16 million records would fit in 6x4 = 4 levels.  If you
cache the first three levels in memory you eat 64^3 = 262144 x
sizeof(record).  Assuming a simple file offset for the record, you eat
exactly 1MB of memory, 64MB for the file index, and your data can
be placed sequentially in the file. 

:- Ryan

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Filesystem holes

2000-10-29 Thread Ryan Thompson

Ryan Thompson wrote to Matt Dillon:

> Matt Dillon wrote to Ryan Thompson:
> 
> > :> :> storage is rather inefficient for our table of about 2,850,000 members
> > :> :> (~2.1 GB total storage).  There are 64M possible hash values in our
> > :> :> current implementation, and our record size is variable, but could be
> > :> :> safely fixed at about 1.5KB... So, total storage if all values were used
> > :> :> would be about 96GB.  (See where I'm going with this?)
> > :...
> > :
> > :Remember, though, I'm not proposing a hash table.  I have been lucky
> > :enough to design a "hash" function that will crunch down all possible
> > :values into about 64M unique keys--so, no collisions.  I propose to use
> > :this function with address calculation.  So, O(1) everything, for single
> > :random lookups, which is (virtually) all this data structure must deal
> > :with.  Sorting isn't necessary, and that could be accomplished in O(n)
> > :anyway.
> > 
> > Are you still talking about creating a 96GB file to manage 2.1GB worth
> > of data?  I gotta say, that sounds kinda nuts to me!  Cpu cycles are
> > cheap, I/O and memory is not.
> 
> Hi, Matt!  Thanks for the replies.  I'll try and keep you interested ;-)
> 
> Hmm... Perhaps you're still missing my original point?  I'm talking about
> a file with 96GB in addressable bytes (well, probably a bunch of files,
> given logical filesize limitations, but let's say for simplicity's sake
> that we have a single file).  It's actual "size" (in terms of allocated
> blocks) will be only a bit larger than 2.1GB.  (Directly proportional to
> the used size of the dataset.  Discrepancies only come into play when
> record size != block size, but that can be worked around somewhat)
> 
> In other words, ls -ls will report the "size" as some ridiculously large
> number, will show a much smaller block count.  So, assuming four records
> are added to the file on block boundaries, the file will actually only use
> four blocks... nowhere near 96GB!
> 
> In the UNIX filesystem (ya, I know.. just pick one :-), size of file !=
> space allocated for file.  Thus, my original questions were centered
> around filesystem holes.  I.e., non-allocated chunks in the middle of a
> file.  When trying to READ from within a hole, the kernel just sends back
> a buffer of zeros... which is enough to show that the record is not
> initialized.  

If you prefer to read system documentation instead of me, see lseek(2) :-)

 The lseek() function allows the file offset to be set beyond the end of
 the existing end-of-file of the file. If data is later written at this
 point, subsequent reads of the data in the gap return bytes of zeros (un-
 til data is actually written into the gap).

I suppose gap == hole.  Silly semantics. :-) 

> Actually, something like an "exists" function for a record
> wouldn't touch the disk at all!  When writing to a hole, the kernel simply
> allocates the necessary block(s).  This is really fast, too, for creation,
> as the empty set can be written to disk with touch(1), and uses far less
> memory than virtual initialization or memory structures ;-)
> 
> As an example, try 
> 
> fseek(f, 0-1, SEEK_SET);
> fputc('\n', f);
> 
> And analyze the reported filesize, as well as the reported block count of
> the file.  You should see a 2GB "size", and maybe 8K in allocated blocks
> (depends how you ran newfs ;-).  This is behaviour that has been around
> since the original UFS.
> 
> 
> > Take the BTree example again -- if you think about it, all internal nodes
> > in the tree will fit into memory trivially.  It is only the leaf nodes
> > that do not.  So in regards to actual disk accesses you still wind up
> > only doing *ONE* for the btree, even if it winds up being four or five
> > levels deep.
> 
> Right.  And, actually, (without looking a bit more closely), I wouldn't be
> suprised if you could replace the four-line address calculation I have
> with your B+Tree structure and come up with the same result.  Only
> difference would be a few hundred lines of code, much more testing, and
> quite a few megs of RAM...  ;-)
> 
> What you referred to as "nuts", above, is just a logical way to provide a
> huge address space for a set of data, without actually allocating blocks
> in the filesystem for the entire address space until they are used.
> 
> 
> > The result is that you will be able to create an index to your data,
> > which is only 2.8 million records, in around 32 bytes per record

Re: Filesystem holes

2000-10-29 Thread Ryan Thompson

Matt Dillon wrote to Ryan Thompson:

> :> :> storage is rather inefficient for our table of about 2,850,000 members
> :> :> (~2.1 GB total storage).  There are 64M possible hash values in our
> :> :> current implementation, and our record size is variable, but could be
> :> :> safely fixed at about 1.5KB... So, total storage if all values were used
> :> :> would be about 96GB.  (See where I'm going with this?)
> :...
> :
> :Remember, though, I'm not proposing a hash table.  I have been lucky
> :enough to design a "hash" function that will crunch down all possible
> :values into about 64M unique keys--so, no collisions.  I propose to use
> :this function with address calculation.  So, O(1) everything, for single
> :random lookups, which is (virtually) all this data structure must deal
> :with.  Sorting isn't necessary, and that could be accomplished in O(n)
> :anyway.
> 
> Are you still talking about creating a 96GB file to manage 2.1GB worth
> of data?  I gotta say, that sounds kinda nuts to me!  Cpu cycles are
> cheap, I/O and memory is not.

Hi, Matt!  Thanks for the replies.  I'll try and keep you interested ;-)

Hmm... Perhaps you're still missing my original point?  I'm talking about
a file with 96GB in addressable bytes (well, probably a bunch of files,
given logical filesize limitations, but let's say for simplicity's sake
that we have a single file).  It's actual "size" (in terms of allocated
blocks) will be only a bit larger than 2.1GB.  (Directly proportional to
the used size of the dataset.  Discrepancies only come into play when
record size != block size, but that can be worked around somewhat)

In other words, ls -ls will report the "size" as some ridiculously large
number, will show a much smaller block count.  So, assuming four records
are added to the file on block boundaries, the file will actually only use
four blocks... nowhere near 96GB!

In the UNIX filesystem (ya, I know.. just pick one :-), size of file !=
space allocated for file.  Thus, my original questions were centered
around filesystem holes.  I.e., non-allocated chunks in the middle of a
file.  When trying to READ from within a hole, the kernel just sends back
a buffer of zeros... which is enough to show that the record is not
initialized.  Actually, something like an "exists" function for a record
wouldn't touch the disk at all!  When writing to a hole, the kernel simply
allocates the necessary block(s).  This is really fast, too, for creation,
as the empty set can be written to disk with touch(1), and uses far less
memory than virtual initialization or memory structures ;-)

As an example, try 

fseek(f, 0-1, SEEK_SET);
fputc('\n', f);

And analyze the reported filesize, as well as the reported block count of
the file.  You should see a 2GB "size", and maybe 8K in allocated blocks
(depends how you ran newfs ;-).  This is behaviour that has been around
since the original UFS.


> Take the BTree example again -- if you think about it, all internal nodes
> in the tree will fit into memory trivially.  It is only the leaf nodes
> that do not.  So in regards to actual disk accesses you still wind up
> only doing *ONE* for the btree, even if it winds up being four or five
> levels deep.

Right.  And, actually, (without looking a bit more closely), I wouldn't be
suprised if you could replace the four-line address calculation I have
with your B+Tree structure and come up with the same result.  Only
difference would be a few hundred lines of code, much more testing, and
quite a few megs of RAM...  ;-)

What you referred to as "nuts", above, is just a logical way to provide a
huge address space for a set of data, without actually allocating blocks
in the filesystem for the entire address space until they are used.


> The result is that you will be able to create an index to your data,
> which is only 2.8 million records, in around 32 bytes per record or
> 89 Megabytes.  Not 96GB.  Since you can cache all the internal nodes
> of the btree you are done.  The machine will do a much better job
> caching a 89MB index then a 96GB index.
> 
> Do not make the mistake of equating cpu to I/O.  The cpu required to
> iterate through 4 levels in the btree (assuming 64 entries per node,
> it only takes 4 to cover 16 million records) is *ZERO* ... as in,
> probably less then a microsecond, whereas accessing the disk is going
> to be milliseconds.

CPU time for what I'm proposing is even closer to zero than for a tree...
But, you're right, it doesn't make any real difference when compared to
disk I/O...  B-Trees are good for a lot of things.  Address calculation
can be really good, too, given a finite key 

Re: Filesystem holes

2000-10-29 Thread Matt Dillon

:> :> storage is rather inefficient for our table of about 2,850,000 members
:> :> (~2.1 GB total storage).  There are 64M possible hash values in our
:> :> current implementation, and our record size is variable, but could be
:> :> safely fixed at about 1.5KB... So, total storage if all values were used
:> :> would be about 96GB.  (See where I'm going with this?)
:...
:
:Remember, though, I'm not proposing a hash table.  I have been lucky
:enough to design a "hash" function that will crunch down all possible
:values into about 64M unique keys--so, no collisions.  I propose to use
:this function with address calculation.  So, O(1) everything, for single
:random lookups, which is (virtually) all this data structure must deal
:with.  Sorting isn't necessary, and that could be accomplished in O(n)
:anyway.

Are you still talking about creating a 96GB file to manage 2.1GB worth
of data?  I gotta say, that sounds kinda nuts to me!  Cpu cycles are
cheap, I/O and memory is not.

Take the BTree example again -- if you think about it, all internal nodes
in the tree will fit into memory trivially.  It is only the leaf nodes
that do not.  So in regards to actual disk accesses you still wind up
only doing *ONE* for the btree, even if it winds up being four or five
levels deep.

The result is that you will be able to create an index to your data,
which is only 2.8 million records, in around 32 bytes per record or
89 Megabytes.  Not 96GB.  Since you can cache all the internal nodes
of the btree you are done.  The machine will do a much better job
caching a 89MB index then a 96GB index.

Do not make the mistake of equating cpu to I/O.  The cpu required to
iterate through 4 levels in the btree (assuming 64 entries per node,
it only takes 4 to cover 16 million records) is *ZERO* ... as in,
probably less then a microsecond, whereas accessing the disk is going
to be milliseconds.

:> A B+Tree will also scale with the size of the dataset being managed,
:> so you do not have to preallocate or prereserve file space.
:
:So will address calculation + filesystem holes, and sufficiently large
:filesizes :-)

Uh, not with a 50:1 file size factor it won't.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Filesystem holes

2000-10-29 Thread Ryan Thompson

Matt Dillon wrote to Ryan Thompson and [EMAIL PROTECTED]:

> :> Hi all...
> :> 
> :> One the tasks that I have undertaken lately is to improve the efficiency
> :> of a couple of storage facilities we use internally, here.  Basically,
> :> they are moderate-size tables currently implemented in SQL, which is OK in
> :> terms of performance, but the hash function is breaking down and the
> :> storage is rather inefficient for our table of about 2,850,000 members
> :> (~2.1 GB total storage).  There are 64M possible hash values in our
> :> current implementation, and our record size is variable, but could be
> :> safely fixed at about 1.5KB... So, total storage if all values were used
> :> would be about 96GB.  (See where I'm going with this?)
> :> 
> :> One of the options I am considering is actually using address calculation,
> :> and taking advantage of filesystem holes, to keep storage down to what is
> :> actually being used, while providing instant lookups.
> :> 
> :> The single file would be about 96G addressable bytes... But the actual
> :> block count would be much lower.  I suppose I will have to create a series
> :> of these files and divide the problem into < 4GB chunks, but one
> :> lookup/store will still be done with one calculation and one disk seek
> :> (just more filehandles open).
> :> 
> :> Deletes seem problematic.  My question is, does the operating system
> :> provide any way to free blocks used in the middle of a file?
> :> 
> :> Must I periodically re-create these files (long and slow process, but not
> :> so bad if done infrequently) to reclaim space, or is there a way to free
> :> arbitrary blocks in a file in userland (WITHOUT trashing the filesystem?
> :> :-)
> :> 
> :> - Ryan
> :> 
> :> -- 
> :>   Ryan Thompson <[EMAIL PROTECTED]>
> :>   Network Administrator, Accounts
> :>   Phone: +1 (306) 664-1161
> :> 
> :>   SaskNow Technologies http://www.sasknow.com
> :>   #106-380 3120 8th St E   Saskatoon, SK  S7H 0W2
> 
> I would strongly recommend using a B+Tree instead of a hash table.  With
> a hash table slightly different lookups will seek all over the place.
> With a B+Tree lookups will stay more localized.

Right... That's a good point, but (and, no, I really didn't mention this
in my original post), "sequential" access is never important with the
system I have described.  Lookups are always random, and they are almost
always done one at a time (multiple lookups only done for data swaps...
very rare, like 1/1e7 accesses... and those are still O(1) (well,
O(2), but who's counting ;-)))...


> For example, if you insert a (nearly) sorted dictionary of words into an
> SQL table with a B+Tree, the memory working set required to insert
> efficiently stays constant whether you are inserting a thousand, a million,
> or a billion records.  That is, the memory requirement is effecitvely
> O(LogN) for a disk requirement of O(N).   With a hash table, the memory
> working set required to insert efficiently is approximately O(N) for a disk
> requirement of O(N)... much much worse.

Remember, though, I'm not proposing a hash table.  I have been lucky
enough to design a "hash" function that will crunch down all possible
values into about 64M unique keys--so, no collisions.  I propose to use
this function with address calculation.  So, O(1) everything, for single
random lookups, which is (virtually) all this data structure must deal
with.  Sorting isn't necessary, and that could be accomplished in O(n)
anyway.


> A B+Tree will also scale with the size of the dataset being managed,
> so you do not have to preallocate or prereserve file space.

So will address calculation + filesystem holes, and sufficiently large
filesizes :-)

#include 

int main()
{
FILE*f;

f = fopen("bigfile", "w");
fseek(f, 0x7fff, SEEK_SET);
putc('\n', f);
fclose(f);
return 0;
}

$ cc -o hole hole.c
$ ./hole
$ ls -lsk bigfile
48 -rw-rw-r--  1 ryan  2147483648 Oct 29 14:09 bigfile


> We are using an in-house SQL database for our product (which I can't
> really talk about right now) and, using B+Tree's, I can consistently
> insert 300 records/sec on a cheap desktop PC (which I use for testing)
> with 64MB of ram (remember, insert requires an internal select to check
> for key conflicts), even when the test database grows to several
> gigabytes.

I haven't even begun implementation, and I haven't had a chance to greatly
experiment... So I don't know how this will perform.  It would be directly
de

Re: Filesystem holes

2000-10-29 Thread Matt Dillon

:> Hi all...
:> 
:> One the tasks that I have undertaken lately is to improve the efficiency
:> of a couple of storage facilities we use internally, here.  Basically,
:> they are moderate-size tables currently implemented in SQL, which is OK in
:> terms of performance, but the hash function is breaking down and the
:> storage is rather inefficient for our table of about 2,850,000 members
:> (~2.1 GB total storage).  There are 64M possible hash values in our
:> current implementation, and our record size is variable, but could be
:> safely fixed at about 1.5KB... So, total storage if all values were used
:> would be about 96GB.  (See where I'm going with this?)
:> 
:> One of the options I am considering is actually using address calculation,
:> and taking advantage of filesystem holes, to keep storage down to what is
:> actually being used, while providing instant lookups.
:> 
:> The single file would be about 96G addressable bytes... But the actual
:> block count would be much lower.  I suppose I will have to create a series
:> of these files and divide the problem into < 4GB chunks, but one
:> lookup/store will still be done with one calculation and one disk seek
:> (just more filehandles open).
:> 
:> Deletes seem problematic.  My question is, does the operating system
:> provide any way to free blocks used in the middle of a file?
:> 
:> Must I periodically re-create these files (long and slow process, but not
:> so bad if done infrequently) to reclaim space, or is there a way to free
:> arbitrary blocks in a file in userland (WITHOUT trashing the filesystem?
:> :-)
:> 
:> - Ryan
:> 
:> -- 
:>   Ryan Thompson <[EMAIL PROTECTED]>
:>   Network Administrator, Accounts
:>   Phone: +1 (306) 664-1161
:> 
:>   SaskNow Technologies http://www.sasknow.com
:>   #106-380 3120 8th St E   Saskatoon, SK  S7H 0W2

I would strongly recommend using a B+Tree instead of a hash table.  With
a hash table slightly different lookups will seek all over the place.
With a B+Tree lookups will stay more localized.

For example, if you insert a (nearly) sorted dictionary of words into an
SQL table with a B+Tree, the memory working set required to insert
efficiently stays constant whether you are inserting a thousand, a million,
or a billion records.  That is, the memory requirement is effecitvely
O(LogN) for a disk requirement of O(N).   With a hash table, the memory
working set required to insert efficiently is approximately O(N) for a disk
requirement of O(N)... much much worse.

A B+Tree will also scale with the size of the dataset being managed,
so you do not have to preallocate or prereserve file space.

We are using an in-house SQL database for our product (which I can't
really talk about right now) and, using B+Tree's, I can consistently
insert 300 records/sec on a cheap desktop PC (which I use for testing)
with 64MB of ram (remember, insert requires an internal select to check
for key conflicts), even when the test database grows to several
gigabytes.

In anycase, a B+Tree is the way to go.  Hash tables are useful only in
very, very, very specialized circumstances.  In all other circumstances
they are no better then a pain in the rear.

-Matt



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Filesystem holes

2000-10-29 Thread Wes Peters

Luigi Rizzo wrote:
> 
> Hi,
> 
> how about using an indirect table of 64M 32-bit pointers into
> the actual blocks being used ? For insertions you just
> allocate a new fixed size block from the file.

Isn't this roughly what dbm does with the hash key method?

-- 
"Where am I, and what am I doing in this handbasket?"

Wes Peters Softweyr LLC
[EMAIL PROTECTED]   http://softweyr.com/


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Filesystem holes

2000-10-28 Thread Luigi Rizzo

Hi,

how about using an indirect table of 64M 32-bit pointers into
the actual blocks being used ? For insertions you just
allocate a new fixed size block from the file. For deletion,
either keep a list of freed blocks, or provide a back pointer
from each entry into the hash table so when you remove a block
(creating a hole) you can move the last allocated one into the
hole and update the hash table.

Kind of quick. You might have to pay an extra disk access on each
access but given enough memory you'd save that one as well.

cheers
luigi

--+-
 Luigi RIZZO, [EMAIL PROTECTED]  . ACIRI/ICSI (on leave from Univ. di Pisa)
 http://www.iet.unipi.it/~luigi/  . 1947 Center St, Berkeley CA 94704
 Phone: (510) 666 2927
--+-

On Sat, 28 Oct 2000, Ryan Thompson wrote:

> 
> Hi all...
> 
> One the tasks that I have undertaken lately is to improve the efficiency
> of a couple of storage facilities we use internally, here.  Basically,
> they are moderate-size tables currently implemented in SQL, which is OK in
> terms of performance, but the hash function is breaking down and the
> storage is rather inefficient for our table of about 2,850,000 members
> (~2.1 GB total storage).  There are 64M possible hash values in our
> current implementation, and our record size is variable, but could be
> safely fixed at about 1.5KB... So, total storage if all values were used
> would be about 96GB.  (See where I'm going with this?)
> 
> One of the options I am considering is actually using address calculation,
> and taking advantage of filesystem holes, to keep storage down to what is
> actually being used, while providing instant lookups.
> 
> The single file would be about 96G addressable bytes... But the actual
> block count would be much lower.  I suppose I will have to create a series
> of these files and divide the problem into < 4GB chunks, but one
> lookup/store will still be done with one calculation and one disk seek
> (just more filehandles open).
> 
> Deletes seem problematic.  My question is, does the operating system
> provide any way to free blocks used in the middle of a file?
> 
> Must I periodically re-create these files (long and slow process, but not
> so bad if done infrequently) to reclaim space, or is there a way to free
> arbitrary blocks in a file in userland (WITHOUT trashing the filesystem?
> :-)
> 
> - Ryan
> 
> -- 
>   Ryan Thompson <[EMAIL PROTECTED]>
>   Network Administrator, Accounts
>   Phone: +1 (306) 664-1161
> 
>   SaskNow Technologies http://www.sasknow.com
>   #106-380 3120 8th St E   Saskatoon, SK  S7H 0W2
> 



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Filesystem holes

2000-10-28 Thread Ryan Thompson


Hi all...

One the tasks that I have undertaken lately is to improve the efficiency
of a couple of storage facilities we use internally, here.  Basically,
they are moderate-size tables currently implemented in SQL, which is OK in
terms of performance, but the hash function is breaking down and the
storage is rather inefficient for our table of about 2,850,000 members
(~2.1 GB total storage).  There are 64M possible hash values in our
current implementation, and our record size is variable, but could be
safely fixed at about 1.5KB... So, total storage if all values were used
would be about 96GB.  (See where I'm going with this?)

One of the options I am considering is actually using address calculation,
and taking advantage of filesystem holes, to keep storage down to what is
actually being used, while providing instant lookups.

The single file would be about 96G addressable bytes... But the actual
block count would be much lower.  I suppose I will have to create a series
of these files and divide the problem into < 4GB chunks, but one
lookup/store will still be done with one calculation and one disk seek
(just more filehandles open).

Deletes seem problematic.  My question is, does the operating system
provide any way to free blocks used in the middle of a file?

Must I periodically re-create these files (long and slow process, but not
so bad if done infrequently) to reclaim space, or is there a way to free
arbitrary blocks in a file in userland (WITHOUT trashing the filesystem?
:-)

- Ryan

-- 
  Ryan Thompson <[EMAIL PROTECTED]>
  Network Administrator, Accounts
  Phone: +1 (306) 664-1161

  SaskNow Technologies http://www.sasknow.com
  #106-380 3120 8th St E   Saskatoon, SK  S7H 0W2



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message