Re: Filesystem holes
> > :> actually being used, while providing instant lookups. > > :> > > :> The single file would be about 96G addressable bytes... But the actual > > :> block count would be much lower. I suppose I will have to create a seri > es > > :> of these files and divide the problem into < 4GB chunks, but one > > :> lookup/store will still be done with one calculation and one disk seek > > :> (just more filehandles open). > > :> > > :> Deletes seem problematic. My question is, does the operating system > > :> provide any way to free blocks used in the middle of a file? > > :> > > :> Must I periodically re-create these files (long and slow process, but no > t > > :> so bad if done infrequently) to reclaim space, or is there a way to free > > :> arbitrary blocks in a file in userland (WITHOUT trashing the filesystem? > > :> :-) > > :> > > :> - Ryan > > :> > > :> -- > > :> Ryan Thompson <[EMAIL PROTECTED]> > > :> Network Administrator, Accounts > > :> Phone: +1 (306) 664-1161 > > :> > > :> SaskNow Technologies http://www.sasknow.com > > :> #106-380 3120 8th St E Saskatoon, SK S7H 0W2 > > > > I would strongly recommend using a B+Tree instead of a hash table. Wit > h > > a hash table slightly different lookups will seek all over the place. > > With a B+Tree lookups will stay more localized. > > Right... That's a good point, but (and, no, I really didn't mention this > in my original post), "sequential" access is never important with the > system I have described. Lookups are always random, and they are almost > always done one at a time (multiple lookups only done for data swaps... > very rare, like 1/1e7 accesses... and those are still O(1) (well, > O(2), but who's counting ;-)))... > Remember, though, I'm not proposing a hash table. I have been lucky > enough to design a "hash" function that will crunch down all possible > values into about 64M unique keys--so, no collisions. I propose to use > this function with address calculation. So, O(1) everything, for single > random lookups, which is (virtually) all this data structure must deal > with. Sorting isn't necessary, and that could be accomplished in O(n) > anyway. > Way to use has for variable-length records: Use an intermediate 'file" whose records are simply pointers tot the location of the real data. Finding a record goes like this: ACnumber=getRecord("KEY"); realRecordNumber=AC(ACnumber); then read record number realRecordNumber. AC I pinched from Adabas which (back in the early 80s, dunno what it does now) had an address covertor. It changed record numers (returned by getRecord() here) to a disk address. Its size is n, where n is the number of records you support. In data storage (andother Adabas term), you have fixed-length records, each containing several records. Adabas stored the Adabas record number and the data files; numbers stored as binary, character fields (usually) preceded with a (one-byte) record size so trailing spaces could be dropped. Inserted records go into empty blocks (except in load mode where blocks are filled to a user-specified percentage) (and presumably into a block in the buffer pool with space). Blocks are compressed before being rewritten - all the data is packed to the front. The first field in the DS block is the offset to free space in the block. Adabas alsh had a work dataset, where data are recorded pending end of transaction; EOT can be acknowldged when the information is safely recorded in work as Adabas can always recover from there. It also has transaction logging, either too tape (alternating with two drives) or two files on disk. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Filesystem holes
:In my case I'd be better off with shared memory objects that aren't :persistent but appear in the name space so that I don't accidentally :start copying a virtual bus file when the programs exit improperly. :In the sparse matrix calculations with no checkpointing or need to appear :in a name space I'd think the best thing would be to use VM with the matrix :initially mapped to a copy on write zero page. I guess you can't :do that without mmap because of swap allocation. : :Peter : :-- :Peter Dufault ([EMAIL PROTECTED]) Realtime development, Machine control, If you can fit the whole thing into a process's VM space, then you can certainly use mmap(...MAP_ANON...) combined with madvise(... MADV_FREE) to implement a sparse memory object. There was talk a while ago about feeding MADV_FREE through to the filesystem layer. I was under the impression that SUN does that, does anyone know? -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Filesystem holes
:to hold a write lock on the range you didn't want rewritten; so :long as it honors the advisory locks, there'd be no chance of it :screwing up, unless you got bit by the stupid POSIX lock close :semantics. Stupid POSIX; that's the other one I'd put in: the :ability to: : : int i = 1; : : fcntl( fd, F_NONPOSIX, &i); : :It would help out the NFS locking daemon to no end... : : Terry Lambert : [EMAIL PROTECTED] We could implement this trivially, and I'm pretty sure we could implement some sort of free-space semantics trivially too, at least for UFS, using a struct flock to pass the parameters to a fcntl. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Filesystem holes
> > I use them for bus simulations, which also are permanently sparse. > > It would be nice to free up the regions when I "remove" a virtual > > board, but in a check through POSIX I could find nothing defined to > > behave that way either for mapped files or mapped memory objects. > (...) > > POSIX was unwilling to take a stand on the F_FREESP vs. ftruncate > war, and so they remained agnostic, and didn't document anything. No, IIRC POSIX defines ftruncate behavior both for mapped files and shared memory objects but that isn't much use for freeing up holes unless you want to repopulate chunks after the hole. All in all I agree with Matt about the suitability of large sparse files for data storage. My case is different in that I want to test object code in pretty much its final form and legacy code will be full of brute force address calculations. ... > Peter: for your rewriting problem, I think you could just decide > to hold a write lock on the range you didn't want rewritten; so > long as it honors the advisory locks, there'd be no chance of it > screwing up, unless you got bit by the stupid POSIX lock close > semantics. Stupid POSIX; that's the other one I'd put in: the > ability to: I never thought of that. I'll look at it. I'll have to see how it is defined in POSIX and how it plays with shared memory objects on Solaris - currently I have no differences in the code other than defining shmopen to be (errno = ENOSYS, -1) if shared memory objects aren't supported and I fall back to mapped files and it all works well. One unfortunateness, though, is that Solaris requires that shared memory objects have the name of a file in "/" where I typically don't want to place multi gigabyte files even when they are allegedly sparse, requiring placing shared memory objects and shared files in different places and also having names such as "/vme_pid7662_data64" since I can't have subdirs. Peter -- Peter Dufault ([EMAIL PROTECTED]) Realtime development, Machine control, HD Associates, Inc. Fail-Safe systems, Agency approval To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Filesystem holes
> I use them for bus simulations, which also are permanently sparse. > It would be nice to free up the regions when I "remove" a virtual > board, but in a check through POSIX I could find nothing defined to > behave that way either for mapped files or mapped memory objects. SVR4 defined F_FREESP; if passed to fcntl, it will call the VOP_SPACE routine. The third argument is a struct flock, and it used to be that it only supported l_len == 0, but that changed in 1993/1994, about the same time we got sfork support flogged into it past the USL Process (caps intentional). 1993 was too late to get either interface fully documented in "The Magic Garden Explained", unfortunately, but it's been in SVR4 (along with sfork, via an ioctl against /proc), since back then. POSIX was unwilling to take a stand on the F_FREESP vs. ftruncate war, and so they remained agnostic, and didn't document anything. FWIW, the SVR4 interface is better, since it allows you to open holes. It was surprisingly hard to get simple changes like this past the keepers of the keys to the USL source tree. After we found out why, it became more understandable: the USL source code is built like a trinary nerve gas weapon, and you have to touch all three parts to get a change into the code. It's really rather arranged So That Things Will Not Change. My preference would be to hook it into fcntl, with F_FREESP, with the extended interface that can do more than truncate. This would require adding a "VOP_SPACE" type thing. There was also an undocumented F_ALLOCSP; I don't remember if that got folded in with the rest of the code, or if it got left out, but it does the obvious thing: allocated backing pages. Peter: for your rewriting problem, I think you could just decide to hold a write lock on the range you didn't want rewritten; so long as it honors the advisory locks, there'd be no chance of it screwing up, unless you got bit by the stupid POSIX lock close semantics. Stupid POSIX; that's the other one I'd put in: the ability to: int i = 1; fcntl( fd, F_NONPOSIX, &i); It would help out the NFS locking daemon to no end... Terry Lambert [EMAIL PROTECTED] --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Filesystem holes
> ... Sparse matrixes are the big math problem > that benefit, but only because the solution to a sparse matrix problem > is not even close to random so the sparse matrix winds up still being > sparse all the way to the end of the solution. I use them for bus simulations, which also are permanently sparse. It would be nice to free up the regions when I "remove" a virtual board, but in a check through POSIX I could find nothing defined to behave that way either for mapped files or mapped memory objects. Also, a write from any process would repopulate the region which I really wouldn't want but I don't see that level of control over mapping between unrelated processes (Now I start thinking about MAPFIXED to a specified virtual address and implementing funny tricks but I don't have time to work on that). In my case I'd be better off with shared memory objects that aren't persistent but appear in the name space so that I don't accidentally start copying a virtual bus file when the programs exit improperly. In the sparse matrix calculations with no checkpointing or need to appear in a name space I'd think the best thing would be to use VM with the matrix initially mapped to a copy on write zero page. I guess you can't do that without mmap because of swap allocation. Peter -- Peter Dufault ([EMAIL PROTECTED]) Realtime development, Machine control, HD Associates, Inc. Fail-Safe systems, Agency approval To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Filesystem holes
: :> Ahh.. yes, I know. I'm a filesystem expert :-) However, that said, I :> will tell you quite frankly that virtually *nobody* depends on holes :> for efficient storage. There are only a few problems where it's :> practical some forms of executables, and sparse matrixes. That's :> pretty much it. : :Your master password database. Most sendmail maps. Anything :else that uses the Berkeley DB, like message catalog files, :locales, etc.. Actually less then you think. Even though DB utilizes the concept of a sparse file, if you check carefully you will note that your sendmail maps and password file databases aren't actually all that sparse. With an 8K filesystem block size the default DB multiplier usually results in virtually no sparseness at all. It takes tuning to get any sort of sparseness and even then you don't get much benefit from it. The only real benefit is in the hash table size factor ... the hash array may be sparse, but the physical file underlying it probably isn't. Not even USENET news history files, which can run into the gigabytes, actually wind up being that sparse. Also keep in mind that Berkeley DB is very old, even with all the rewrites, and the original hash algorithm was chosen for expediency rather then for the 'best' efficiency. Our current DB library has a btree access method, and for big data sets it works a whole lot better then the hash method. It doesn't require tuning, for one. :Frankly, sparse files have a huge number of uses, particularly :when applied to persistant storage of data of the kind you'd :see in chapter 5, section 5.4.x and chapter 6 in Knuth's. : :Plus your FFS filesystem itself is a sparse matrix. It'd be :real useful to be able to "free up holes" in a file, if I :wanted to use one to do user space work on an FS design, for :example, a log structured FS, where I wanted to be able to :experiment with a "cleaner" process that recovered extents. : :I'd actually be able to tell real quickly whether it was :working by just setting an allocation range that I expect :my iterative testing to stay within (if it goes over or under :the range while I'm moving stuff around and cleaning at the :same time, I'll know there's a bug in my daemon). : :Personally, I'm not rich enough to be able to burn disk space :so easily. : Terry Lambert : [EMAIL PROTECTED] I agree that sparse files should not be discarded out of hand. There are very few real problems that need them though. I can think of a handful, all very specialized for example, the VN device uses the concept of a sparse-file (actually a sparse backing for the filesystem layer), as do Kirk's softupdates snapshots (I believe). Sparse matrixes are the big math problem that benefit, but only because the solution to a sparse matrix problem is not even close to random so the sparse matrix winds up still being sparse all the way to the end of the solution. But these are extremely specialized problems, not general datastore problems, and nearly all of these problems are tuned (or inherently) specific to the block size of the underlying system. Using a sparse file for general data store just isn't all that hot an idea, because by its very nature data store is, well, storing data. Packing it is usually the way to go. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Filesystem holes
> Ahh.. yes, I know. I'm a filesystem expert :-) However, that said, I > will tell you quite frankly that virtually *nobody* depends on holes > for efficient storage. There are only a few problems where it's > practical some forms of executables, and sparse matrixes. That's > pretty much it. Your master password database. Most sendmail maps. Anything else that uses the Berkeley DB, like message catalog files, locales, etc.. Frankly, sparse files have a huge number of uses, particularly when applied to persistant storage of data of the kind you'd see in chapter 5, section 5.4.x and chapter 6 in Knuth's. Plus your FFS filesystem itself is a sparse matrix. It'd be real useful to be able to "free up holes" in a file, if I wanted to use one to do user space work on an FS design, for example, a log structured FS, where I wanted to be able to experiment with a "cleaner" process that recovered extents. I'd actually be able to tell real quickly whether it was working by just setting an allocation range that I expect my iterative testing to stay within (if it goes over or under the range while I'm moving stuff around and cleaning at the same time, I'll know there's a bug in my daemon). Personally, I'm not rich enough to be able to burn disk space so easily. Terry Lambert [EMAIL PROTECTED] --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Filesystem holes
:The original question still stands, and I'm quite interested in hearing :an answer. : :I think Ryan's looking for an equivalent to Solaris' F_FREESP fcntl; I'm :not aware that one exists in FBSD - right? : :jan : :-- :jan grant, ILRT, University of Bristol. http://www.ilrt.bris.ac.uk/ No, we don't have anything like that, though our VFS layer could support it. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Filesystem holes
The original question still stands, and I'm quite interested in hearing an answer. I think Ryan's looking for an equivalent to Solaris' F_FREESP fcntl; I'm not aware that one exists in FBSD - right? jan -- jan grant, ILRT, University of Bristol. http://www.ilrt.bris.ac.uk/ Tel +44(0)117 9287163 Fax +44 (0)117 9287112 RFC822 [EMAIL PROTECTED] stty intr ^m To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Filesystem holes
:Yes, there's a high probability of that. It's one of the reasons :why people typically use the feature, at least not for 'permanent' :data sets. Ahhh... of course I meant 'typically do not use the feature'. Heh. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Filesystem holes
Leif Neland wrote to Ryan Thompson and Matt Dillon: > > What will happen, if somebody (possibly you, as mahordomo says), tries to > make a backup of that file. Make sure to use a program that can cope ;-) > Will the copy also be with holes, or would that file suddenly use all 96GB? > It will at least do so, if one does cat file>file.bak > Probably tar will do the same. Actually, tar will handle holes elegantly (this I have tested), with the --sparse option. Older versions would not. cat and other similar "filters" are naive, as they simply block I/O. Backing up with tar and/or a filesystem dump would be just as effective as with any other storage strategy. cat file > file.bak on even a 2GB file is probably not something that would be popular, anyway. > I'd be afraid to create something which could easily blow up by having > normal operations applied to it. That's a valid concern. That's the biggest drawback I see to the overall strategy... But, specialized problems sometimes encourage specialized solutions. > > Leif > -- Ryan Thompson <[EMAIL PROTECTED]> Network Administrator, Accounts Phone: +1 (306) 664-1161 SaskNow Technologies http://www.sasknow.com #106-380 3120 8th St E Saskatoon, SK S7H 0W2 To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Filesystem holes
: :Presumably writing into these holes repeatedly is a none-too-efficient :affair (requiring moving all the stuff on either side), or am I missing the :point slightly? : :Dave :) It isn't quite that bad, but it isn't a walk in the park either. Since data blocks are referenced from a block table, inserting new blocks is cheap. However, this means that data may not be physically ordered in the file -- if your read crosses a block boundry it may require an extra seek. FreeBSD's FFS does have the reallocblks feature turned on and this will cause the filesystem to try to reorder nearby blocks, but it was never designed to handle sparse files so -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Filesystem holes
:> :What will happen, if somebody (possibly you, as mahordomo says), tries to :make a backup of that file. That will depend on the backup program. dump/restore can handle holes just fine. tar can handle them in a 'fake' way, and you have to tell it. Programs like 'cp' cannot handle holes they'll copy the zero's. :I'd be afraid to create something which could easily blow up by having :normal operations applied to it. : :Leif Yes, there's a high probability of that. It's one of the reasons why people typically use the feature, at least not for 'permanent' data sets. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Filesystem holes
Matt Dillon wrote to Ryan Thompson: > :Hi, Matt! Thanks for the replies. I'll try and keep you interested ;-) > : > :Hmm... Perhaps you're still missing my original point? I'm talking about > :a file with 96GB in addressable bytes (well, probably a bunch of files, > :given logical filesize limitations, but let's say for simplicity's sake > :that we have a single file). It's actual "size" (in terms of allocated > :blocks) will be only a bit larger than 2.1GB. (Directly proportional to > :the used size of the dataset. Discrepancies only come into play when > :record size != block size, but that can be worked around somewhat) > > Ah, ok... that's better, though I think you will find yourself > tuning it endlessly if the blocksize does not match-up. Remember, > the filesystem allocates 8K per block, so if your record size is > 1500 bytes and you have a random distribution, you are going to > wind up eating 8-14GB. True.. As they say, "Disk is cheap", though... And "tuning" isn't so bad on a simple address calculation. I agree that if the record size doesn't closely match the blocksize, there will be a waste of space, but at least that waste is proportional to the dataset... Better than many data structures that I could call by name. It wouldn't be that difficult with many problem sets to optimize them to the point where their records align neatly with the blocksize. Granted, the waste space would hang you with some problem sets. I propose, though, that for some problems, it is easy enough to tune them to reduce (or, if you're really lucky, eliminate), waste space. Yes, this is a specialized solution. > :In other words, ls -ls will report the "size" as some ridiculously large > :number, will show a much smaller block count. So, assuming four records > :are added to the file on block boundaries, the file will actually only use > :four blocks... nowhere near 96GB! > : > :In the UNIX filesystem (ya, I know.. just pick one :-), size of file != > :space allocated for file. Thus, my original questions were centered > :around filesystem holes. I.e., non-allocated chunks in the middle of a > :file. When trying to READ from within a hole, the kernel just sends back > :a buffer of zeros... which is enough to show that the record is not > :initialized. Actually, something like an "exists" function for a record > :wouldn't touch the disk at all! When writing to a hole, the kernel simply > :allocates the necessary block(s). This is really fast, too, for creation, > :as the empty set can be written to disk with touch(1), and uses far less > :memory than virtual initialization or memory structures ;-) > : > :As an example, try > > Ahh.. yes, I know. I'm a filesystem expert :-) I know, Matt, and that's why it's great to talk to you about this :-) I guess I needed an example to convey my point, though. I apologize if it came across as an insult to your intelligence; 'twas not intended that way in the least. > However, that said, I > will tell you quite frankly that virtually *nobody* depends on holes > for efficient storage. Ahh. Ok. This is the kind of response I was looking for. Case studies :-) > There are only a few problems where it's > practical some forms of executables, and sparse matrixes. That's > pretty much it. > :And analyze the reported filesize, as well as the reported block count of > :the file. You should see a 2GB "size", and maybe 8K in allocated blocks > :(depends how you ran newfs ;-). This is behaviour that has been around > :since the original UFS. > > It's not a good idea to use a small block size in UFS. The minimum > is pretty much 1K frag, 8K block. for a sparse file this means 8K > is the minimum that will be allocated (frags are only allocated for > the end of a file). Right, which is why this strategy _does_ only work for a specific subset of problems, just as a directed graph works well for a path structure, but would really suck for (among other things) maintaining a sorted list of account numbers. > If you think the btree idea is too complex to implement quickly, then > try using a radix tree (essentially what you said in regards to > breaking the hash you calculate into a node traversal). For your > problem I would recommend 64 elements per node, which corresponds to > 6 bits. 16 million records would fit in 6x4 = 4 levels. If you > cache the first three levels in memory you eat 64^3 = 262144 x > sizeof(record). Assuming a simple file offset for the record, you eat > exactly 1MB of memory, 64MB for the file index, and yo
Re: Filesystem holes
> Hmm... Perhaps you're still missing my original point? I'm talking about > a file with 96GB in addressable bytes (well, probably a bunch of files, > given logical filesize limitations, but let's say for simplicity's sake > that we have a single file). It's actual "size" (in terms of allocated > blocks) will be only a bit larger than 2.1GB. (Directly proportional to > the used size of the dataset. Discrepancies only come into play when > record size != block size, but that can be worked around somewhat) > > In other words, ls -ls will report the "size" as some ridiculously large > number, will show a much smaller block count. So, assuming four records > are added to the file on block boundaries, the file will actually only use > four blocks... nowhere near 96GB! > > In the UNIX filesystem (ya, I know.. just pick one :-), size of file != > space allocated for file. Thus, my original questions were centered > around filesystem holes. I.e., non-allocated chunks in the middle of a > file. When trying to READ from within a hole, the kernel just sends back > a buffer of zeros... which is enough to show that the record is not > initialized. Actually, something like an "exists" function for a record > wouldn't touch the disk at all! When writing to a hole, the kernel simply > allocates the necessary block(s). This is really fast, too, for creation, > as the empty set can be written to disk with touch(1), and uses far less > memory than virtual initialization or memory structures ;-) > What will happen, if somebody (possibly you, as mahordomo says), tries to make a backup of that file. Will the copy also be with holes, or would that file suddenly use all 96GB? It will at least do so, if one does cat file>file.bak Probably tar will do the same. I'd be afraid to create something which could easily blow up by having normal operations applied to it. Leif To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Filesystem holes
At 19:16 29/10/00 -0800, you wrote: > Ahh.. yes, I know. I'm a filesystem expert :-) However, that said, I > will tell you quite frankly that virtually *nobody* depends on holes > for efficient storage. There are only a few problems where it's > practical some forms of executables, and sparse matrixes. That's > pretty much it. Presumably writing into these holes repeatedly is a none-too-efficient affair (requiring moving all the stuff on either side), or am I missing the point slightly? Dave :) To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Filesystem holes
:Hi, Matt! Thanks for the replies. I'll try and keep you interested ;-) : :Hmm... Perhaps you're still missing my original point? I'm talking about :a file with 96GB in addressable bytes (well, probably a bunch of files, :given logical filesize limitations, but let's say for simplicity's sake :that we have a single file). It's actual "size" (in terms of allocated :blocks) will be only a bit larger than 2.1GB. (Directly proportional to :the used size of the dataset. Discrepancies only come into play when :record size != block size, but that can be worked around somewhat) Ah, ok... that's better, though I think you will find yourself tuning it endlessly if the blocksize does not match-up. Remember, the filesystem allocates 8K per block, so if your record size is 1500 bytes and you have a random distribution, you are going to wind up eating 8-14GB. :In other words, ls -ls will report the "size" as some ridiculously large :number, will show a much smaller block count. So, assuming four records :are added to the file on block boundaries, the file will actually only use :four blocks... nowhere near 96GB! : :In the UNIX filesystem (ya, I know.. just pick one :-), size of file != :space allocated for file. Thus, my original questions were centered :around filesystem holes. I.e., non-allocated chunks in the middle of a :file. When trying to READ from within a hole, the kernel just sends back :a buffer of zeros... which is enough to show that the record is not :initialized. Actually, something like an "exists" function for a record :wouldn't touch the disk at all! When writing to a hole, the kernel simply :allocates the necessary block(s). This is really fast, too, for creation, :as the empty set can be written to disk with touch(1), and uses far less :memory than virtual initialization or memory structures ;-) : :As an example, try Ahh.. yes, I know. I'm a filesystem expert :-) However, that said, I will tell you quite frankly that virtually *nobody* depends on holes for efficient storage. There are only a few problems where it's practical some forms of executables, and sparse matrixes. That's pretty much it. :And analyze the reported filesize, as well as the reported block count of :the file. You should see a 2GB "size", and maybe 8K in allocated blocks :(depends how you ran newfs ;-). This is behaviour that has been around :since the original UFS. It's not a good idea to use a small block size in UFS. The minimum is pretty much 1K frag, 8K block. for a sparse file this means 8K is the minimum that will be allocated (frags are only allocated for the end of a file). If you think the btree idea is too complex to implement quickly, then try using a radix tree (essentially what you said in regards to breaking the hash you calculate into a node traversal). For your problem I would recommend 64 elements per node, which corresponds to 6 bits. 16 million records would fit in 6x4 = 4 levels. If you cache the first three levels in memory you eat 64^3 = 262144 x sizeof(record). Assuming a simple file offset for the record, you eat exactly 1MB of memory, 64MB for the file index, and your data can be placed sequentially in the file. :- Ryan -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Filesystem holes
Ryan Thompson wrote to Matt Dillon: > Matt Dillon wrote to Ryan Thompson: > > > :> :> storage is rather inefficient for our table of about 2,850,000 members > > :> :> (~2.1 GB total storage). There are 64M possible hash values in our > > :> :> current implementation, and our record size is variable, but could be > > :> :> safely fixed at about 1.5KB... So, total storage if all values were used > > :> :> would be about 96GB. (See where I'm going with this?) > > :... > > : > > :Remember, though, I'm not proposing a hash table. I have been lucky > > :enough to design a "hash" function that will crunch down all possible > > :values into about 64M unique keys--so, no collisions. I propose to use > > :this function with address calculation. So, O(1) everything, for single > > :random lookups, which is (virtually) all this data structure must deal > > :with. Sorting isn't necessary, and that could be accomplished in O(n) > > :anyway. > > > > Are you still talking about creating a 96GB file to manage 2.1GB worth > > of data? I gotta say, that sounds kinda nuts to me! Cpu cycles are > > cheap, I/O and memory is not. > > Hi, Matt! Thanks for the replies. I'll try and keep you interested ;-) > > Hmm... Perhaps you're still missing my original point? I'm talking about > a file with 96GB in addressable bytes (well, probably a bunch of files, > given logical filesize limitations, but let's say for simplicity's sake > that we have a single file). It's actual "size" (in terms of allocated > blocks) will be only a bit larger than 2.1GB. (Directly proportional to > the used size of the dataset. Discrepancies only come into play when > record size != block size, but that can be worked around somewhat) > > In other words, ls -ls will report the "size" as some ridiculously large > number, will show a much smaller block count. So, assuming four records > are added to the file on block boundaries, the file will actually only use > four blocks... nowhere near 96GB! > > In the UNIX filesystem (ya, I know.. just pick one :-), size of file != > space allocated for file. Thus, my original questions were centered > around filesystem holes. I.e., non-allocated chunks in the middle of a > file. When trying to READ from within a hole, the kernel just sends back > a buffer of zeros... which is enough to show that the record is not > initialized. If you prefer to read system documentation instead of me, see lseek(2) :-) The lseek() function allows the file offset to be set beyond the end of the existing end-of-file of the file. If data is later written at this point, subsequent reads of the data in the gap return bytes of zeros (un- til data is actually written into the gap). I suppose gap == hole. Silly semantics. :-) > Actually, something like an "exists" function for a record > wouldn't touch the disk at all! When writing to a hole, the kernel simply > allocates the necessary block(s). This is really fast, too, for creation, > as the empty set can be written to disk with touch(1), and uses far less > memory than virtual initialization or memory structures ;-) > > As an example, try > > fseek(f, 0-1, SEEK_SET); > fputc('\n', f); > > And analyze the reported filesize, as well as the reported block count of > the file. You should see a 2GB "size", and maybe 8K in allocated blocks > (depends how you ran newfs ;-). This is behaviour that has been around > since the original UFS. > > > > Take the BTree example again -- if you think about it, all internal nodes > > in the tree will fit into memory trivially. It is only the leaf nodes > > that do not. So in regards to actual disk accesses you still wind up > > only doing *ONE* for the btree, even if it winds up being four or five > > levels deep. > > Right. And, actually, (without looking a bit more closely), I wouldn't be > suprised if you could replace the four-line address calculation I have > with your B+Tree structure and come up with the same result. Only > difference would be a few hundred lines of code, much more testing, and > quite a few megs of RAM... ;-) > > What you referred to as "nuts", above, is just a logical way to provide a > huge address space for a set of data, without actually allocating blocks > in the filesystem for the entire address space until they are used. > > > > The result is that you will be able to create an index to your data, > > which is only 2.8 million records, in around 32 bytes per record
Re: Filesystem holes
Matt Dillon wrote to Ryan Thompson: > :> :> storage is rather inefficient for our table of about 2,850,000 members > :> :> (~2.1 GB total storage). There are 64M possible hash values in our > :> :> current implementation, and our record size is variable, but could be > :> :> safely fixed at about 1.5KB... So, total storage if all values were used > :> :> would be about 96GB. (See where I'm going with this?) > :... > : > :Remember, though, I'm not proposing a hash table. I have been lucky > :enough to design a "hash" function that will crunch down all possible > :values into about 64M unique keys--so, no collisions. I propose to use > :this function with address calculation. So, O(1) everything, for single > :random lookups, which is (virtually) all this data structure must deal > :with. Sorting isn't necessary, and that could be accomplished in O(n) > :anyway. > > Are you still talking about creating a 96GB file to manage 2.1GB worth > of data? I gotta say, that sounds kinda nuts to me! Cpu cycles are > cheap, I/O and memory is not. Hi, Matt! Thanks for the replies. I'll try and keep you interested ;-) Hmm... Perhaps you're still missing my original point? I'm talking about a file with 96GB in addressable bytes (well, probably a bunch of files, given logical filesize limitations, but let's say for simplicity's sake that we have a single file). It's actual "size" (in terms of allocated blocks) will be only a bit larger than 2.1GB. (Directly proportional to the used size of the dataset. Discrepancies only come into play when record size != block size, but that can be worked around somewhat) In other words, ls -ls will report the "size" as some ridiculously large number, will show a much smaller block count. So, assuming four records are added to the file on block boundaries, the file will actually only use four blocks... nowhere near 96GB! In the UNIX filesystem (ya, I know.. just pick one :-), size of file != space allocated for file. Thus, my original questions were centered around filesystem holes. I.e., non-allocated chunks in the middle of a file. When trying to READ from within a hole, the kernel just sends back a buffer of zeros... which is enough to show that the record is not initialized. Actually, something like an "exists" function for a record wouldn't touch the disk at all! When writing to a hole, the kernel simply allocates the necessary block(s). This is really fast, too, for creation, as the empty set can be written to disk with touch(1), and uses far less memory than virtual initialization or memory structures ;-) As an example, try fseek(f, 0-1, SEEK_SET); fputc('\n', f); And analyze the reported filesize, as well as the reported block count of the file. You should see a 2GB "size", and maybe 8K in allocated blocks (depends how you ran newfs ;-). This is behaviour that has been around since the original UFS. > Take the BTree example again -- if you think about it, all internal nodes > in the tree will fit into memory trivially. It is only the leaf nodes > that do not. So in regards to actual disk accesses you still wind up > only doing *ONE* for the btree, even if it winds up being four or five > levels deep. Right. And, actually, (without looking a bit more closely), I wouldn't be suprised if you could replace the four-line address calculation I have with your B+Tree structure and come up with the same result. Only difference would be a few hundred lines of code, much more testing, and quite a few megs of RAM... ;-) What you referred to as "nuts", above, is just a logical way to provide a huge address space for a set of data, without actually allocating blocks in the filesystem for the entire address space until they are used. > The result is that you will be able to create an index to your data, > which is only 2.8 million records, in around 32 bytes per record or > 89 Megabytes. Not 96GB. Since you can cache all the internal nodes > of the btree you are done. The machine will do a much better job > caching a 89MB index then a 96GB index. > > Do not make the mistake of equating cpu to I/O. The cpu required to > iterate through 4 levels in the btree (assuming 64 entries per node, > it only takes 4 to cover 16 million records) is *ZERO* ... as in, > probably less then a microsecond, whereas accessing the disk is going > to be milliseconds. CPU time for what I'm proposing is even closer to zero than for a tree... But, you're right, it doesn't make any real difference when compared to disk I/O... B-Trees are good for a lot of things. Address calculation can be really good, too, given a finite key
Re: Filesystem holes
:> :> storage is rather inefficient for our table of about 2,850,000 members :> :> (~2.1 GB total storage). There are 64M possible hash values in our :> :> current implementation, and our record size is variable, but could be :> :> safely fixed at about 1.5KB... So, total storage if all values were used :> :> would be about 96GB. (See where I'm going with this?) :... : :Remember, though, I'm not proposing a hash table. I have been lucky :enough to design a "hash" function that will crunch down all possible :values into about 64M unique keys--so, no collisions. I propose to use :this function with address calculation. So, O(1) everything, for single :random lookups, which is (virtually) all this data structure must deal :with. Sorting isn't necessary, and that could be accomplished in O(n) :anyway. Are you still talking about creating a 96GB file to manage 2.1GB worth of data? I gotta say, that sounds kinda nuts to me! Cpu cycles are cheap, I/O and memory is not. Take the BTree example again -- if you think about it, all internal nodes in the tree will fit into memory trivially. It is only the leaf nodes that do not. So in regards to actual disk accesses you still wind up only doing *ONE* for the btree, even if it winds up being four or five levels deep. The result is that you will be able to create an index to your data, which is only 2.8 million records, in around 32 bytes per record or 89 Megabytes. Not 96GB. Since you can cache all the internal nodes of the btree you are done. The machine will do a much better job caching a 89MB index then a 96GB index. Do not make the mistake of equating cpu to I/O. The cpu required to iterate through 4 levels in the btree (assuming 64 entries per node, it only takes 4 to cover 16 million records) is *ZERO* ... as in, probably less then a microsecond, whereas accessing the disk is going to be milliseconds. :> A B+Tree will also scale with the size of the dataset being managed, :> so you do not have to preallocate or prereserve file space. : :So will address calculation + filesystem holes, and sufficiently large :filesizes :-) Uh, not with a 50:1 file size factor it won't. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Filesystem holes
Matt Dillon wrote to Ryan Thompson and [EMAIL PROTECTED]: > :> Hi all... > :> > :> One the tasks that I have undertaken lately is to improve the efficiency > :> of a couple of storage facilities we use internally, here. Basically, > :> they are moderate-size tables currently implemented in SQL, which is OK in > :> terms of performance, but the hash function is breaking down and the > :> storage is rather inefficient for our table of about 2,850,000 members > :> (~2.1 GB total storage). There are 64M possible hash values in our > :> current implementation, and our record size is variable, but could be > :> safely fixed at about 1.5KB... So, total storage if all values were used > :> would be about 96GB. (See where I'm going with this?) > :> > :> One of the options I am considering is actually using address calculation, > :> and taking advantage of filesystem holes, to keep storage down to what is > :> actually being used, while providing instant lookups. > :> > :> The single file would be about 96G addressable bytes... But the actual > :> block count would be much lower. I suppose I will have to create a series > :> of these files and divide the problem into < 4GB chunks, but one > :> lookup/store will still be done with one calculation and one disk seek > :> (just more filehandles open). > :> > :> Deletes seem problematic. My question is, does the operating system > :> provide any way to free blocks used in the middle of a file? > :> > :> Must I periodically re-create these files (long and slow process, but not > :> so bad if done infrequently) to reclaim space, or is there a way to free > :> arbitrary blocks in a file in userland (WITHOUT trashing the filesystem? > :> :-) > :> > :> - Ryan > :> > :> -- > :> Ryan Thompson <[EMAIL PROTECTED]> > :> Network Administrator, Accounts > :> Phone: +1 (306) 664-1161 > :> > :> SaskNow Technologies http://www.sasknow.com > :> #106-380 3120 8th St E Saskatoon, SK S7H 0W2 > > I would strongly recommend using a B+Tree instead of a hash table. With > a hash table slightly different lookups will seek all over the place. > With a B+Tree lookups will stay more localized. Right... That's a good point, but (and, no, I really didn't mention this in my original post), "sequential" access is never important with the system I have described. Lookups are always random, and they are almost always done one at a time (multiple lookups only done for data swaps... very rare, like 1/1e7 accesses... and those are still O(1) (well, O(2), but who's counting ;-)))... > For example, if you insert a (nearly) sorted dictionary of words into an > SQL table with a B+Tree, the memory working set required to insert > efficiently stays constant whether you are inserting a thousand, a million, > or a billion records. That is, the memory requirement is effecitvely > O(LogN) for a disk requirement of O(N). With a hash table, the memory > working set required to insert efficiently is approximately O(N) for a disk > requirement of O(N)... much much worse. Remember, though, I'm not proposing a hash table. I have been lucky enough to design a "hash" function that will crunch down all possible values into about 64M unique keys--so, no collisions. I propose to use this function with address calculation. So, O(1) everything, for single random lookups, which is (virtually) all this data structure must deal with. Sorting isn't necessary, and that could be accomplished in O(n) anyway. > A B+Tree will also scale with the size of the dataset being managed, > so you do not have to preallocate or prereserve file space. So will address calculation + filesystem holes, and sufficiently large filesizes :-) #include int main() { FILE*f; f = fopen("bigfile", "w"); fseek(f, 0x7fff, SEEK_SET); putc('\n', f); fclose(f); return 0; } $ cc -o hole hole.c $ ./hole $ ls -lsk bigfile 48 -rw-rw-r-- 1 ryan 2147483648 Oct 29 14:09 bigfile > We are using an in-house SQL database for our product (which I can't > really talk about right now) and, using B+Tree's, I can consistently > insert 300 records/sec on a cheap desktop PC (which I use for testing) > with 64MB of ram (remember, insert requires an internal select to check > for key conflicts), even when the test database grows to several > gigabytes. I haven't even begun implementation, and I haven't had a chance to greatly experiment... So I don't know how this will perform. It would be directly de
Re: Filesystem holes
:> Hi all... :> :> One the tasks that I have undertaken lately is to improve the efficiency :> of a couple of storage facilities we use internally, here. Basically, :> they are moderate-size tables currently implemented in SQL, which is OK in :> terms of performance, but the hash function is breaking down and the :> storage is rather inefficient for our table of about 2,850,000 members :> (~2.1 GB total storage). There are 64M possible hash values in our :> current implementation, and our record size is variable, but could be :> safely fixed at about 1.5KB... So, total storage if all values were used :> would be about 96GB. (See where I'm going with this?) :> :> One of the options I am considering is actually using address calculation, :> and taking advantage of filesystem holes, to keep storage down to what is :> actually being used, while providing instant lookups. :> :> The single file would be about 96G addressable bytes... But the actual :> block count would be much lower. I suppose I will have to create a series :> of these files and divide the problem into < 4GB chunks, but one :> lookup/store will still be done with one calculation and one disk seek :> (just more filehandles open). :> :> Deletes seem problematic. My question is, does the operating system :> provide any way to free blocks used in the middle of a file? :> :> Must I periodically re-create these files (long and slow process, but not :> so bad if done infrequently) to reclaim space, or is there a way to free :> arbitrary blocks in a file in userland (WITHOUT trashing the filesystem? :> :-) :> :> - Ryan :> :> -- :> Ryan Thompson <[EMAIL PROTECTED]> :> Network Administrator, Accounts :> Phone: +1 (306) 664-1161 :> :> SaskNow Technologies http://www.sasknow.com :> #106-380 3120 8th St E Saskatoon, SK S7H 0W2 I would strongly recommend using a B+Tree instead of a hash table. With a hash table slightly different lookups will seek all over the place. With a B+Tree lookups will stay more localized. For example, if you insert a (nearly) sorted dictionary of words into an SQL table with a B+Tree, the memory working set required to insert efficiently stays constant whether you are inserting a thousand, a million, or a billion records. That is, the memory requirement is effecitvely O(LogN) for a disk requirement of O(N). With a hash table, the memory working set required to insert efficiently is approximately O(N) for a disk requirement of O(N)... much much worse. A B+Tree will also scale with the size of the dataset being managed, so you do not have to preallocate or prereserve file space. We are using an in-house SQL database for our product (which I can't really talk about right now) and, using B+Tree's, I can consistently insert 300 records/sec on a cheap desktop PC (which I use for testing) with 64MB of ram (remember, insert requires an internal select to check for key conflicts), even when the test database grows to several gigabytes. In anycase, a B+Tree is the way to go. Hash tables are useful only in very, very, very specialized circumstances. In all other circumstances they are no better then a pain in the rear. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Filesystem holes
Luigi Rizzo wrote: > > Hi, > > how about using an indirect table of 64M 32-bit pointers into > the actual blocks being used ? For insertions you just > allocate a new fixed size block from the file. Isn't this roughly what dbm does with the hash key method? -- "Where am I, and what am I doing in this handbasket?" Wes Peters Softweyr LLC [EMAIL PROTECTED] http://softweyr.com/ To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Filesystem holes
Hi, how about using an indirect table of 64M 32-bit pointers into the actual blocks being used ? For insertions you just allocate a new fixed size block from the file. For deletion, either keep a list of freed blocks, or provide a back pointer from each entry into the hash table so when you remove a block (creating a hole) you can move the last allocated one into the hole and update the hash table. Kind of quick. You might have to pay an extra disk access on each access but given enough memory you'd save that one as well. cheers luigi --+- Luigi RIZZO, [EMAIL PROTECTED] . ACIRI/ICSI (on leave from Univ. di Pisa) http://www.iet.unipi.it/~luigi/ . 1947 Center St, Berkeley CA 94704 Phone: (510) 666 2927 --+- On Sat, 28 Oct 2000, Ryan Thompson wrote: > > Hi all... > > One the tasks that I have undertaken lately is to improve the efficiency > of a couple of storage facilities we use internally, here. Basically, > they are moderate-size tables currently implemented in SQL, which is OK in > terms of performance, but the hash function is breaking down and the > storage is rather inefficient for our table of about 2,850,000 members > (~2.1 GB total storage). There are 64M possible hash values in our > current implementation, and our record size is variable, but could be > safely fixed at about 1.5KB... So, total storage if all values were used > would be about 96GB. (See where I'm going with this?) > > One of the options I am considering is actually using address calculation, > and taking advantage of filesystem holes, to keep storage down to what is > actually being used, while providing instant lookups. > > The single file would be about 96G addressable bytes... But the actual > block count would be much lower. I suppose I will have to create a series > of these files and divide the problem into < 4GB chunks, but one > lookup/store will still be done with one calculation and one disk seek > (just more filehandles open). > > Deletes seem problematic. My question is, does the operating system > provide any way to free blocks used in the middle of a file? > > Must I periodically re-create these files (long and slow process, but not > so bad if done infrequently) to reclaim space, or is there a way to free > arbitrary blocks in a file in userland (WITHOUT trashing the filesystem? > :-) > > - Ryan > > -- > Ryan Thompson <[EMAIL PROTECTED]> > Network Administrator, Accounts > Phone: +1 (306) 664-1161 > > SaskNow Technologies http://www.sasknow.com > #106-380 3120 8th St E Saskatoon, SK S7H 0W2 > To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Filesystem holes
Hi all... One the tasks that I have undertaken lately is to improve the efficiency of a couple of storage facilities we use internally, here. Basically, they are moderate-size tables currently implemented in SQL, which is OK in terms of performance, but the hash function is breaking down and the storage is rather inefficient for our table of about 2,850,000 members (~2.1 GB total storage). There are 64M possible hash values in our current implementation, and our record size is variable, but could be safely fixed at about 1.5KB... So, total storage if all values were used would be about 96GB. (See where I'm going with this?) One of the options I am considering is actually using address calculation, and taking advantage of filesystem holes, to keep storage down to what is actually being used, while providing instant lookups. The single file would be about 96G addressable bytes... But the actual block count would be much lower. I suppose I will have to create a series of these files and divide the problem into < 4GB chunks, but one lookup/store will still be done with one calculation and one disk seek (just more filehandles open). Deletes seem problematic. My question is, does the operating system provide any way to free blocks used in the middle of a file? Must I periodically re-create these files (long and slow process, but not so bad if done infrequently) to reclaim space, or is there a way to free arbitrary blocks in a file in userland (WITHOUT trashing the filesystem? :-) - Ryan -- Ryan Thompson <[EMAIL PROTECTED]> Network Administrator, Accounts Phone: +1 (306) 664-1161 SaskNow Technologies http://www.sasknow.com #106-380 3120 8th St E Saskatoon, SK S7H 0W2 To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message