Re: fs/udf: vm pages "overlap" while reading large dir [patch]
Andriy Gapon wrote: on 26/02/2008 21:23 Pav Lucistnik said the following: Pav Lucistnik píše v út 05. 02. 2008 v 19:16 +0100: Andriy Gapon píše v út 05. 02. 2008 v 16:40 +0200: Yay, and can you fix the sequential read performance while you're at it? Kthx! this was almost trivial :-) See the attached patch, first hunk is just for consistency. The code was borrowed from cd9660, only field/variable names are adjusted. Just tested it with my shiny new Bluray drive, and it work wonders. Finally seamless playback of media files off UDF carrying media. So, how does it look WRT committing it? Pav, thank you for the feedback/reminder. In my personal option the latest patch posted and described at the following like is a very good candidate for commit: http://docs.FreeBSD.org/cgi/mid.cgi?47AA43B9.1040608 It might have a couple of style related issues, but it should improve things a bit even if some larger VM/VFS/GEOM issues remain. And by the way, a patch from the following PR would be a good "side-dish" for the above patch: http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/78987 I think it is simple and obvious enough. I will commit both of these to CVS today. Thanks again. Scott ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: fs/udf: vm pages "overlap" while reading large dir [patch]
Pav Lucistnik wrote: Andriy Gapon píše v čt 28. 02. 2008 v 10:33 +0200: And while I have your attention, I have a related question. I have produced a bunch of ISO9660 Level 3 / UDF hybrid media with mkisofs, and when I mount the UDF part of them, the mount point (root directory of media) have 0x000 permissions. Yes that's right, d- in ls -l. That makes the whole volume inaccessible for everyone except root. Is this something you can mend in our UDF driver, or should I go dig inside mkisofs guts? Windows handle these media without any visible problems. I wonder if Windows even observes the permissions bits. You'd have to special-case the UDF driver code in FreeBSD, which certainly possible but not terribly attractive. I'd be interested to see what exactly mkiso is doing. Maybe it's putting permissions into extended attributes and assuming the filesystem driver will read those instead. Scott ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: fs/udf: vm pages "overlap" while reading large dir [patch]
on 28/02/2008 11:59 Pav Lucistnik said the following: > Andriy Gapon píše v čt 28. 02. 2008 v 10:33 +0200: > > And while I have your attention, I have a related question. > > I have produced a bunch of ISO9660 Level 3 / UDF hybrid media with > mkisofs, and when I mount the UDF part of them, the mount point (root > directory of media) have 0x000 permissions. Yes that's right, d- > in ls -l. That makes the whole volume inaccessible for everyone except > root. > > Is this something you can mend in our UDF driver, or should I go dig > inside mkisofs guts? Windows handle these media without any visible > problems. > I think this ought to be something in mkisofs, last I checked our code was fully conforming to the specs in this regard. And obviously it works with windoz, because it simply ignores the permissions :-) We might consider adding a fixup for such broken media (if my assessment is correct), but I'd rather prefer that it is fixed at the source. -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: fs/udf: vm pages "overlap" while reading large dir [patch]
Andriy Gapon píše v čt 28. 02. 2008 v 10:33 +0200: And while I have your attention, I have a related question. I have produced a bunch of ISO9660 Level 3 / UDF hybrid media with mkisofs, and when I mount the UDF part of them, the mount point (root directory of media) have 0x000 permissions. Yes that's right, d- in ls -l. That makes the whole volume inaccessible for everyone except root. Is this something you can mend in our UDF driver, or should I go dig inside mkisofs guts? Windows handle these media without any visible problems. -- Pav Lucistnik <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> How will you recognize experienced hacker from beginner? Beginner thinks that kilobyte have 1000 bytes. Experienced hacker thinks one kilometer have 1024 meters. signature.asc Description: Toto je digitálně podepsaná část zprávy
Re: fs/udf: vm pages "overlap" while reading large dir [patch]
on 26/02/2008 21:23 Pav Lucistnik said the following: > Pav Lucistnik píše v út 05. 02. 2008 v 19:16 +0100: >> Andriy Gapon píše v út 05. 02. 2008 v 16:40 +0200: >> Yay, and can you fix the sequential read performance while you're at it? Kthx! >>> this was almost trivial :-) >>> See the attached patch, first hunk is just for consistency. >>> The code was borrowed from cd9660, only field/variable names are adjusted. > > Just tested it with my shiny new Bluray drive, and it work wonders. > Finally seamless playback of media files off UDF carrying media. > > So, how does it look WRT committing it? > Pav, thank you for the feedback/reminder. In my personal option the latest patch posted and described at the following like is a very good candidate for commit: http://docs.FreeBSD.org/cgi/mid.cgi?47AA43B9.1040608 It might have a couple of style related issues, but it should improve things a bit even if some larger VM/VFS/GEOM issues remain. And by the way, a patch from the following PR would be a good "side-dish" for the above patch: http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/78987 I think it is simple and obvious enough. -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: fs/udf: vm pages "overlap" while reading large dir [patch]
Pav Lucistnik píše v út 05. 02. 2008 v 19:16 +0100: > Andriy Gapon píše v út 05. 02. 2008 v 16:40 +0200: > > > > Yay, and can you fix the sequential read performance while you're at it? > > > Kthx! > > > this was almost trivial :-) > > See the attached patch, first hunk is just for consistency. > > The code was borrowed from cd9660, only field/variable names are adjusted. Just tested it with my shiny new Bluray drive, and it work wonders. Finally seamless playback of media files off UDF carrying media. So, how does it look WRT committing it? -- Pav Lucistnik <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> Somebody ought to cross ball point pens with coat hangers so that the pens will multiply instead of disappear. signature.asc Description: Toto je digitálně podepsaná část zprávy
Re: fs/udf: vm pages "overlap" while reading large dir
On Tue, 12 Feb 2008, Andriy Gapon wrote: on 12/02/2008 15:11 Bruce Evans said the following: On Tue, 12 Feb 2008, Poul-Henning Kamp wrote: In message <[EMAIL PROTECTED]>, Andriy Gapon writes: 3.1. for a fresh buf getlbk would assign the following: bsize = bo->bo_bsize; offset = blkno * bsize; That's clearly wrong. If units were always 512-blocks, then anything except bsize = DEV_BSIZE would be clearly wrong. Things aren't that simple (but probably should be). Even RELENG_3 has bsize = f_iosize (possibly != 512) for non-disks. O, I missed this obvious thing! Me too. Bruce, thank you for putting me back on the ground :-) Even in UDF case, when we bread() via a file or directory vnode we pass a logical 2048-byte block number (within the file) to bread(). In this case the code in getblk() does the right thing when it calculates offset as blkno * 2048. Calculating it assuming 512-byte units would be a disaster. I think the size is mnt_stat.f_iosize in general (but not for device vnodes). And the actual reading works correctly because udf_strategy is called for such vnodes and it translates block numbers from physical to logical and also correctly re-calculates b_iooffset for actual reading. So b_iooffset value set in breadn() (using bdtob) is completely ignored. Is this setting ever used (and the corresponding setting for writing) ever used? I remember that this is why g_vfs_* was my primary target. It seems that I revert my opinion back: either g_vfs_open should be smart about setting bo_bsize, or g_vfs_strategy should be smart about calculating bio_offset, or individual filesystems should not depend on g_vfs_* always doing the best thing for them. In fact, ffs already overrides the setting of bo_bsize for the device vnode to a different wrong setting -- g_vfs_open() sets the sector size, and ffs_mount() changes the setting to fs_bsize, but ffs actually needs the setting to be DEV_BSIZE (I think). Other bugs from this: - ffs_rawread wants the sector size, and it assumes that this is in bo_bsize for the device vnode, but ffs_mount() has changed this to fs_bsize. - multiple r/o mounts are supposed to work, but don't, since there is only one device vnode with a shared bufobj, but the bufobj needs to be per-file-system since all mounts write to it. Various bad things including panics result. There is a well-know panic from bo_private becoming garbage on unmount. I just noticed more possibilities for panics. bo_ops points to static storage, so it never becomes complete garbage. However, at least ffs sets it blindly early on in ffs_mountfs(), before looking at the file system to see if ffs can mount it. Thus if the file system is already mounted by another ffs, then ffs clobbers the other fs's bo_ops. The ffs mount will presumably fail, leaving bo_ops clobbered. Also, a successful g_vfs_open() has clobbered bo_ops, bo_private and bo_bsize a little earlier. g_vfs_open() is fundamental to looking at the file system, since I/O is not set up until it completes. Clobbering the pointers is most dangerous, but just clobbering bo_bsize breaks blkno calculations for any code that uses bo_bsize. Apart from these bugs, the fix for the blkno calculations for device vp's may be as simple as setting bo_bsize to DEV_BSIZE for the device vp of all disk file systems (since all disk file systems use expect this size). Currently, ffs seems to be the only file system that overrides g_vfs_open()'s default of the sector size. Nothing in any of fs/, gnu/fs/ and contrib/ references bo_bsize. In any case, it seems that it is the g_vfs_* that is currently inconsistent: it sets bo_bsize to a somewhat arbitrary value, but expects that it would always be provided with correct b_iooffset (the latter being rigidly calculated via bdtob() in buffcache code). So this leaves filesystems dead in the water while reading via device vnode: provide blkno in 512-byte units - screw up VM cache, provide blkno in bo_bsize units - screw up reading from disk. I think I/O on the disk doesn't use bo_bsize, but only the sector size (to scale the byte count to a sector count). Otherwise, ffs's override would break I/O. geom is another place that has few references to bo_bsize -- it just clobbers it in g_vfs_open(). Not sure if the FS-s should have wits to set bo_bsize properly and/or provide proper bo_ops - we are talking about a device vnode here. Should filesystems be involved in the business of setting its properties? Not sure. Yes. They can set it more easily as they can tell g_vfs_open() what to set it to. Except, until the bugs are fixed properly, g_vfs_open() can more easily do tests to prevent clobbering. For bo_bsize and bo_ops, sharing a common value is safe and safe changes can be detected by setting to a special value on last unmount. For bo_private, a safety check would probably disallow multiple mounts (since cp is dynamically allocated (?)). Bruce _
Re: fs/udf: vm pages "overlap" while reading large dir
on 12/02/2008 17:58 Bruce Evans said the following: > On Tue, 12 Feb 2008, Andriy Gapon wrote: >> And the actual reading works correctly because udf_strategy is called >> for such vnodes and it translates block numbers from physical to logical >> and also correctly re-calculates b_iooffset for actual reading. So >> b_iooffset value set in breadn() (using bdtob) is completely ignored. > > Is this setting ever used (and the corresponding setting for writing) > ever used? Bruce, replying only to this part (digesting the others): yes, it is used by g_vfs_strategy which is a bufobj startegy after g_vfs_open. It passes b_iooffset as is to bio_offset. P.S. of course, I am speaking about the current sources - I know you have almost an OS of your own, kidding :-) -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: fs/udf: vm pages "overlap" while reading large dir
on 12/02/2008 15:11 Bruce Evans said the following: > On Tue, 12 Feb 2008, Poul-Henning Kamp wrote: > >> In message <[EMAIL PROTECTED]>, Andriy Gapon writes: >> >>> 2.3. this code passes to bread blkno that is calculated as 4*sector, >>> where sector is a number of a physical 2048-byte sector. [**] >>> [**] - I think that this is a requirement of buffcache system, because >>> internally it performs many calculations that seem to assume that block >>> size is always 512. >> Yes, the buf-cache works in 512 bytes units throughout. > > I missed the dbtob() conversions in vfs_bio.c when I replied previously > So blkno cannot be a cookie; it must be for a 512-block. So how did > the cases where bsize != DEV_BSIZE in getblk() ever work? It seems that this is because specific VOP_STRATEGY and/or BO_STRATEGY kicked in and did the right thing, see below. >>> 3.1. for a fresh buf getlbk would assign the following: >>> bsize = bo->bo_bsize; >>> offset = blkno * bsize; >> That's clearly wrong. > > If units were always 512-blocks, then anything except bsize = DEV_BSIZE > would be clearly wrong. Things aren't that simple (but probably should > be). Even RELENG_3 has bsize = f_iosize (possibly != 512) for non-disks. O, I missed this obvious thing! Bruce, thank you for putting me back on the ground :-) Even in UDF case, when we bread() via a file or directory vnode we pass a logical 2048-byte block number (within the file) to bread(). In this case the code in getblk() does the right thing when it calculates offset as blkno * 2048. Calculating it assuming 512-byte units would be a disaster. And the actual reading works correctly because udf_strategy is called for such vnodes and it translates block numbers from physical to logical and also correctly re-calculates b_iooffset for actual reading. So b_iooffset value set in breadn() (using bdtob) is completely ignored. I remember that this is why g_vfs_* was my primary target. It seems that I revert my opinion back: either g_vfs_open should be smart about setting bo_bsize, or g_vfs_strategy should be smart about calculating bio_offset, or individual filesystems should not depend on g_vfs_* always doing the best thing for them. In any case, it seems that it is the g_vfs_* that is currently inconsistent: it sets bo_bsize to a somewhat arbitrary value, but expects that it would always be provided with correct b_iooffset (the latter being rigidly calculated via bdtob() in buffcache code). So this leaves filesystems dead in the water while reading via device vnode: provide blkno in 512-byte units - screw up VM cache, provide blkno in bo_bsize units - screw up reading from disk. Not sure if the FS-s should have wits to set bo_bsize properly and/or provide proper bo_ops - we are talking about a device vnode here. Should filesystems be involved in the business of setting its properties? Not sure. But it seems that there are many possibilities for "fixups" and various filesystems [have to] do stuff in their unique ways (*_STRATEGY, etc). -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: fs/udf: vm pages "overlap" while reading large dir
On Tue, 12 Feb 2008, Poul-Henning Kamp wrote: In message <[EMAIL PROTECTED]>, Andriy Gapon writes: 2.3. this code passes to bread blkno that is calculated as 4*sector, where sector is a number of a physical 2048-byte sector. [**] [**] - I think that this is a requirement of buffcache system, because internally it performs many calculations that seem to assume that block size is always 512. Yes, the buf-cache works in 512 bytes units throughout. I missed the dbtob() conversions in vfs_bio.c when I replied previously So blkno cannot be a cookie; it must be for a 512-block. So how did the cases where bsize != DEV_BSIZE in getblk() ever work? 3.1. for a fresh buf getlbk would assign the following: bsize = bo->bo_bsize; offset = blkno * bsize; That's clearly wrong. If units were always 512-blocks, then anything except bsize = DEV_BSIZE would be clearly wrong. Things aren't that simple (but probably should be). Even RELENG_3 has bsize = f_iosize (possibly != 512) for non-disks. That seems to include nfs(client). In fact, nfs_getcacheblk() does weird scaling which seems to be mainly to compensate for for weird scaling here. It calls getblk() with a bn arg that seems to be f_iosize units. Then at then end, for the VREG case, it sets bp->b_blkno to this bn scaled to normal DEV_BSIZE units. bp->b_blkno seems to have DEV_BSIZE units for all uses of it in nfs. We need to assert that the blkno is aligned to the start of a sector and use the 512 byte units, so I guess it would be: offset = dbtob(blkno); KASSERT(!(offset & (bsize - 1)), ("suitable diagnostic")); Barely worth checking. The current bug has nothing to do with this. The offset is just wrong at this point, after using a scale factor that is inconsistent with the units of blkno, for all (?) disk (and other?) file systems whose sector size isn't 512. Bruce ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: fs/udf: vm pages "overlap" while reading large dir
On Tue, 12 Feb 2008, Poul-Henning Kamp wrote: In message <[EMAIL PROTECTED]>, Andriy Gapon writes: on 06/02/2008 18:29 Andriy Gapon said the following: Small summary of the above long description. For directory reading fs/udf performs bread() on a (underlying) device vnode. It passes block number as if block size was 512 bytes (i.e. byte_offset_within_dev/512). We have three sizes of relevance here, the sectorsize of the provider, the blocksize of the filesystem and the page size of the system. 4. The block size required for bread() and friends (almost always DEV_BSIZE). In general it is adventurous to have any of them be anything but powers of two, and it is at best ill-adviced and more likely asking for trouble to do requests that are not multiple of and aligned to the sectorsize of the provider. So up front, I would say that it is an UDF bug to do 512 byte reads off a 2k sector provider. Making the buffer cache handle this is possible, but it is not the direction we have planned for the buffercache (ie: that it should become a wrapper for using the VM system, rather than being a separately allocated cache). So if the objective is to make UDF work in the short term, your change might work, if the objective is to move FreeBSD's kernel architecture forward, the right thing to do is to fix UDF to not access subsectors. This bug has nothing to do with subsectors, and very little to do with udf. udf is just depending on bread()'s API being unbroken. That API requires starting with blocks consisting of whole sectors (else the subsequent I/O would fail) and converting to blocks of size DEV_BSIZE, phyexcept for unusual (non-disk?) file systems like nfs (and no others?). All (?) disk file systems do this conversion. The bug is just more noticeable for udf since it is used more often on disks with a block size != DEV_BSIZE, and it apparently does something that makes the bug mess up VM more than other file systems. Of course, it might be better for the API to take blocks in units of the sector size, but that is not what it takes, and this can't be changed easily or safely. The standard macro btodb() is often used to convert from bytes to blocks of this size, and doesn't support bo_bsize or the bufobj pointer that would be needed to reach bo_bsize. ffs mostly uses its fsbtodb() macro, which converts blocks in ffs block (frag?)-sized units to blocks in DEV_BSIZE units so as to pass them to unbroken bread() and friends. ffs initializes bo_bsize to its block (not frag) size, and then never uses it directly except in ffs_rawread(), where it is used to check that the I/O is in units of whole sectors (otherwise, ffs_rawread() has DEV_BSIZE more hard-coded than the rest of ffs). The details of fsbtodb() are interesting. They show signs of previous attempts to use units of sectors. From ffs/fs.h: % /* % * Turn filesystem block numbers into disk block addresses. % * This maps filesystem blocks to device size blocks. % */ % #define fsbtodb(fs, b) ((daddr_t)(b) << (fs)->fs_fsbtodb) % #define dbtofsb(fs, b) ((b) >> (fs)->fs_fsbtodb) Creation of fs_fsbtodb is left to newfs. The units of DEV_BSIZE are thus built in to the on-disk data struct (in a fairly easy to change and/or override way, but there are a lot of existing disks). From newfs.c: % realsectorsize = sectorsize; % if (sectorsize != DEV_BSIZE) { /* XXX */ % int secperblk = sectorsize / DEV_BSIZE; % % sectorsize = DEV_BSIZE; % fssize *= secperblk; % if (pp != NULL) % pp->p_size *= secperblk; % } % mkfs(pp, special); Though mkfs() appears to support disk blocks of size = sectorsize, its sectorsize parameter is hacked on here, so it always generates fs_fsbtodb and other derived parameters for disk blocks of size DEV_BSIZE, as is required for the unbroken bread() API. msdosfs is similar to ffs here, except it constructs the shift count at mount time, to arrange for always converting to DEV_BSIZE blocks for passing the bread() and friends. udf is effectively similar, but pessimal and disorganized. For the conversion for bread(), it uses btodb(). In udf_bmap(), it constructs a shift count on every call by subtracting DEV_BSHIFT from its block shift count. In udf_strategy(), on every call it constructs a multiplier instead of a shift count, by dividing its block size by DEV_BSIZE. It's weird that the strategy routing, which will soon end up sending sectors to the disk, is converting to DEV_BSIZE units, a size that cannot be the size for udf's normal media. cd9660 uses btodb() for one conversion for bread() and ( - DEV_BSHIFT) in 7 other conversions to DEV_BSIZE units. So it's surprising that any file systems work on CDs and DVDs. Maybe the bug affects udf more because udf crosses page boundaries more. It's also surprising that the bug has such small effects. This seems to be because the blkno arg to bread() and fr
Re: fs/udf: vm pages "overlap" while reading large dir
on 12/02/2008 13:47 Poul-Henning Kamp said the following: > In message <[EMAIL PROTECTED]>, Andriy Gapon writes: > >> 2.3. this code passes to bread blkno that is calculated as 4*sector, >> where sector is a number of a physical 2048-byte sector. [**] >> [**] - I think that this is a requirement of buffcache system, because >> internally it performs many calculations that seem to assume that block >> size is always 512. > > Yes, the buf-cache works in 512 bytes units throughout. > >> 3.1. for a fresh buf getlbk would assign the following: >> bsize = bo->bo_bsize; >> offset = blkno * bsize; > > That's clearly wrong. > > We need to assert that the blkno is aligned to the start of a sector > and use the 512 byte units, so I guess it would be: > > offset = dbtob(blkno); > KASSERT(!(offset & (bsize - 1)), ("suitable diagnostic")); > > Thank you for this very insightful and neat suggestion! I think that it must work but I will try test it tonight on whatever media and FS-s I have available. Thank you again! P.S. hope to not get '"suitable diagnostic"' from something like msdosfs :-) -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: fs/udf: vm pages "overlap" while reading large dir
In message <[EMAIL PROTECTED]>, Andriy Gapon writes: >2.3. this code passes to bread blkno that is calculated as 4*sector, >where sector is a number of a physical 2048-byte sector. [**] >[**] - I think that this is a requirement of buffcache system, because >internally it performs many calculations that seem to assume that block >size is always 512. Yes, the buf-cache works in 512 bytes units throughout. >3.1. for a fresh buf getlbk would assign the following: >bsize = bo->bo_bsize; >offset = blkno * bsize; That's clearly wrong. We need to assert that the blkno is aligned to the start of a sector and use the 512 byte units, so I guess it would be: offset = dbtob(blkno); KASSERT(!(offset & (bsize - 1)), ("suitable diagnostic")); -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: fs/udf: vm pages "overlap" while reading large dir
on 12/02/2008 10:53 Poul-Henning Kamp said the following: > In message <[EMAIL PROTECTED]>, Andriy Gapon writes: >> on 06/02/2008 18:29 Andriy Gapon said the following: >>> Small summary of the above long description. >>> For directory reading fs/udf performs bread() on a (underlying) device >>> vnode. It passes block number as if block size was 512 bytes (i.e. >>> byte_offset_within_dev/512). > > We have three sizes of relevance here, the sectorsize of the provider, > the blocksize of the filesystem and the page size of the system. > > In general it is adventurous to have any of them be anything but > powers of two, and it is at best ill-adviced and more likely asking > for trouble to do requests that are not multiple of and aligned to > the sectorsize of the provider. > > So up front, I would say that it is an UDF bug to do 512 byte reads off > a 2k sector provider. > > Making the buffer cache handle this is possible, but it is not the > direction we have planned for the buffercache (ie: that it should > become a wrapper for using the VM system, rather than being a > separately allocated cache). > > So if the objective is to make UDF work in the short term, your > change might work, if the objective is to move FreeBSD's kernel > architecture forward, the right thing to do is to fix UDF to not > access subsectors. > Poul, I agree with what you say, but I think that I didn't properly explain what is UDF code doing and why it might be important in general. Let me try to do it step-by-step (sorry if I'll say something too obvious). And please correct me if I misunderstand something in the fundamental steps. 1.1. UDF is typically used with CD/DVD media, so provider's sector size is 2048. 1.2. udf vfs mount method calls g_vfs_open. 1.3. g_vfs_open creates vobject for a device vnode. 1.4. g_vfs_open sets bo_bsize=pp->sectorsize in the device vnode's bufobj. 1.5. g_vfs_open also overrides bo_ops for the bufobj. 2.1. UDF directory reading code performs bread() via the device vnode. [*] 2.2. this code passes to bread a size that is multiple of 2048. 2.3. this code passes to bread blkno that is calculated as 4*sector, where sector is a number of a physical 2048-byte sector. [**] [*] - this seems to be an uncommon thing among filesystems. [**] - I think that this is a requirement of buffcache system, because internally it performs many calculations that seem to assume that block size is always 512. E.g. breadn() code has the following: bp->b_iooffset = dbtob(bp->b_blkno); <-- effectively multiplies by 512 bstrategy(bp); And g_vfs_strategy has the following: bip->bio_offset = bp->b_iooffset; So, if udf code would pass blkno==sector, then bio_offset would be incorrect. Or maybe g_vfs_strategy should do some translation here from b_iooffset to bio_offset taking into account bo_bsize ?? So that the actual, non-adjusted, sector number could be passed to the bread() ? 3.1. for a fresh buf getlbk would assign the following: bsize = bo->bo_bsize; offset = blkno * bsize; ... bp->b_blkno = bp->b_lblkno = blkno; bp->b_offset = offset; <--- this is where this discussion started so b_offset of a buffer is 4*sector*2048. This is a source of the trouble. 3.2. getblk would set bp->b_flags |= B_VMIO; if a vnode has a vobject and our device vnode has it (step 1.3). 3.3. consequently allocbuf will execute code for B_VMIO case. 3.4. allocbuf will lookup/allocate pages by index which is calculated from "base index" of OFF_TO_IDX(bp->b_offset), which is, in essence, bp->b_offset divided by page size, 4096. So our "base index" is 2*sector. 4.1. Let's assume we bread() (via the device vnode, as described above) from physical sector N and we read 6 physical sectors (6*2048 bytes). 4.2 Sectors will get "mapped"/tied/bound to VM pages as follows (according to the above calculations): sectors[N, N+1] -> page[2*N], sectors[N+2, N+3] -> page[2*N + 1], /* the next page */ sectors[N+4, N+5] -> page[2*N + 2] /* the next page */ 4.5 Now lets consider "the next read" of X sectors but now starting from sector N+1; repeating the calculations we get the following mapping: sectors[(N+1), (N+1) + 1] -> page[2*(N+1)] = page[2N +2] But this page already has cached data from sectors[N+4, N+5]. Problem!! Theoretical calculations show it and practical testing confirms that. So this is a proof that bread()-ing via a device vnode is broken if: C1) the vnode was "set up" by g_vfs_open(); C2) sector size of underlying geom provider is not 512, but any non-trivial multiple of it; C3) bread size is sufficiently big; Current UDF code for directory reading is the only known to me place that meets all the 3 above conditions (for sufficiently large directories to meet condition C3). So I stated this, now let the wise speak. I already have a patch that makes UDF read directories via the directory vnodes. But the problem in general would remain. Maybe g_vfs_open is a correct place to change, maybe g_vfs_strategy is the place, maybe something, maybe "don
Re: fs/udf: vm pages "overlap" while reading large dir
In message <[EMAIL PROTECTED]>, Andriy Gapon writes: >on 06/02/2008 18:29 Andriy Gapon said the following: >> Small summary of the above long description. >> For directory reading fs/udf performs bread() on a (underlying) device >> vnode. It passes block number as if block size was 512 bytes (i.e. >> byte_offset_within_dev/512). We have three sizes of relevance here, the sectorsize of the provider, the blocksize of the filesystem and the page size of the system. In general it is adventurous to have any of them be anything but powers of two, and it is at best ill-adviced and more likely asking for trouble to do requests that are not multiple of and aligned to the sectorsize of the provider. So up front, I would say that it is an UDF bug to do 512 byte reads off a 2k sector provider. Making the buffer cache handle this is possible, but it is not the direction we have planned for the buffercache (ie: that it should become a wrapper for using the VM system, rather than being a separately allocated cache). So if the objective is to make UDF work in the short term, your change might work, if the objective is to move FreeBSD's kernel architecture forward, the right thing to do is to fix UDF to not access subsectors. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: fs/udf: vm pages "overlap" while reading large dir
on 06/02/2008 18:29 Andriy Gapon said the following: > Small summary of the above long description. > For directory reading fs/udf performs bread() on a (underlying) device > vnode. It passes block number as if block size was 512 bytes (i.e. > byte_offset_within_dev/512). On the other hand vm page index calculation > code uses the following formula: (blkno*bo_bsize)/PAGE_SIZE. > For CD/DVD devices bo_bsize is set to 2048. This results in page index > overlap when reading sufficiently many blocks from adjacent starting blocks. > That is, it seems that in "normal" cases page index gets calculated as > byte_offset/PAGE_SIZE, but in this case page index gets to be > 4*byte_offset/PAGE_SIZE. More details are in the quoted above text. > > With a lot of help from Bruce Evance I found this commit which seems to > have made a big difference: > http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_bio.c.diff?r1=1.458;r2=1.459;f=h > > Before this change page index for device vnodes was always > (blkno*512)/PAGE_SIZE. This is because of the vn_isdisk() case. > > udf_mountfs device vnode is passed to g_vfs_open() (in geom_vfs.c) and > the latter has the following line of code: > bo->bo_bsize = pp->sectorsize; > Where pp is geom provider for the device in question. > > I am not sure if the mentioned above vfs_bio.c change broke something > else in reading from device vnodes or if it fixed something for that. I > am also not sure what would be the consequences of setting bo_bsize to > 512 for vnodes of "disk" devices. It would most probably fix current > code in UDF, but I am not sure if it might break anything else. > > Just wanted to let the more knowledgeable people know and make a decision. > > P.S. bo_bsize seems to be referenced only in a handful of files and in > most of them it seems that it is related to "file" vnodes (as opposed to > "device" vnodes). > Poul-Henning, I am sorry for this duplicate post and very sorry for misspelling your name in the first one. do you have any comment or opinion on the above? [sorry if clamped too much of the previous context, but it's all in archives] And a little continuation: Just for the sake of experiment I tried to emulated the previous behavior by changing geom_vfs.c, function g_vfs_open(), so that the relevant code reads as follows: if (vn_isdisk(vp, NULL)) bo->bo_bsize = DEV_BSIZE; else bo->bo_bsize = pp->sectorsize; It didn't break anything for me and it re-enabled current (CVS) fs/udf code to work as before. (patch from kern/78987 still has to be applied for large directories to be handled). So udf (on DVD) is fixed, ufs (on HDD), msdosfs (on HDD) and cd9660 (on CD) are not broken. Of course, the change should have affected only filesystems on CD/DVD (where sector/block size is not 512 bytes). Of course, unlike Bruce I don't use msdosfs on CD/DVD media (e.g. DVD-RAM), so my test is somewhat incomplete. -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: fs/udf: vm pages "overlap" while reading large dir
on 06/02/2008 18:29 Andriy Gapon said the following: > Small summary of the above long description. > For directory reading fs/udf performs bread() on a (underlying) device > vnode. It passes block number as if block size was 512 bytes (i.e. > byte_offset_within_dev/512). On the other hand vm page index calculation > code uses the following formula: (blkno*bo_bsize)/PAGE_SIZE. > For CD/DVD devices bo_bsize is set to 2048. This results in page index > overlap when reading sufficiently many blocks from adjacent starting blocks. > That is, it seems that in "normal" cases page index gets calculated as > byte_offset/PAGE_SIZE, but in this case page index gets to be > 4*byte_offset/PAGE_SIZE. More details are in the quoted above text. > > With a lot of help from Bruce Evance I found this commit which seems to > have made a big difference: > http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_bio.c.diff?r1=1.458;r2=1.459;f=h > > Before this change page index for device vnodes was always > (blkno*512)/PAGE_SIZE. This is because of the vn_isdisk() case. > > udf_mountfs device vnode is passed to g_vfs_open() (in geom_vfs.c) and > the latter has the following line of code: > bo->bo_bsize = pp->sectorsize; > Where pp is geom provider for the device in question. > > I am not sure if the mentioned above vfs_bio.c change broke something > else in reading from device vnodes or if it fixed something for that. I > am also not sure what would be the consequences of setting bo_bsize to > 512 for vnodes of "disk" devices. It would most probably fix current > code in UDF, but I am not sure if it might break anything else. > > Just wanted to let the more knowledgeable people know and make a decision. > > P.S. bo_bsize seems to be referenced only in a handful of files and in > most of them it seems that it is related to "file" vnodes (as opposed to > "device" vnodes). > Paul, do you have any comment or opinion on the above? [sorry if clamped too much of the previous context, but it's all in archives] And a little continuation: Just for the sake of experiment I tried to emulated the previous behavior by changing geom_vfs.c, function g_vfs_open(), so that the relevant code reads as follows: if (vn_isdisk(vp, NULL)) bo->bo_bsize = DEV_BSIZE; else bo->bo_bsize = pp->sectorsize; It didn't break anything for me and it re-enabled current (CVS) fs/udf code to work as before. (patch from kern/78987 still has to be applied for large directories to be handled). So udf (on DVD) is fixed, ufs (on HDD), msdosfs (on HDD) and cd9660 (on CD) are not broken. Of course, the change should have affected only filesystems on CD/DVD (where sector/block size is not 512 bytes). Of course, unlike Bruce I don't use msdosfs on CD/DVD media (e.g. DVD-RAM), so my test is somewhat incomplete. -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: fs/udf: vm pages "overlap" while reading large dir [patch]
on 06/02/2008 18:34 Andriy Gapon said the following: > > Actually the patch is not entirely correct. max_size returned from > udf_bmap_internal should be used to calculate number of continuous > sectors for read-ahead (as opposed to file size in the patch). > Attached is an updated patch. The most prominent changes from the previous version: 1. udf_read can handle files with data embedded into fentry (sysutils/udfclient can produce such things for small files). 2. above mentioned files can now be mmap-ed, so that you can not only cat them, but also cp them; btw, I believe that cp(1) should have logic to try to fallback to read if mmap() fails - remember kern/92040 ? :-) 3. udf_bmap uses max_size[*] returned by udf_bmap_internal to calculate a number of contiguous blocks for read-ahead; [*] - this is a number of contiguous bytes from a given offset within a file. Things that stay the same: 1. I still use bread() via directory vnode; I think it's better even if reading via device vnode would work correctly. 2. I still try to read as much data as possible in directory bread(), I think this is a sort of an ad-hoc read-ahead without all the heuristics and logic that cluster reading does; I think that this should be ok because we do not have random reading of directory, it's always sequential - either to read-in the whole directory or linearly search through it until wanted entry is found. Detailed description of the patch. udf_vnops.c: Hunk1 - this is "just in case" patch, reject files with unsupported strategy right away instead of deferring failure to udf_bmap_internal. Hunk2 - fix typo in existing macro, add new macro to check if a file has data embedded into fentry ("unusual file"). Hunk3 - for an unusual file just uiomove data from fentry. Hunk4 - cosmetic changes plus a fix for udf_bmap_internal errors not actually being honored; also added an additional printf, because udf_strategy should never really be called for unusual files. Hunk5 - return EOPNOTSUPP for unusual files, this is correct because we can not correctly bmap them and also this enables correct handling of this files in vm code (vnode_pager); also code for read-ahead calculation borrowed from cd9660. Hunk6- explain function of udf_bmap_internal call in this place. Hunk7 - some cosmetics; prevent size passed to bread from being greater than MAXBSIZE; do bread via directory vnode rather than device vnode (udf_readlblks was a macro-wrapper around bread). udf_vfsops.c: Hunk1 - this is borrowed from cd9660, apparently this data is needed for correct cluster reading. Couple of words about testing. udf in 7.0-RC1 plus this patch correctly handled everything I could throw at it - huge directories, unusual files, reading, mmaping. Using udfclientfs I wrote a directory on DVD-RAM UDF disk that contained 2G of files from ports distfiles. The files varied in size from about 100 of bytes to hundreds of megabytes. I watched systat -vmstat while copying this directory. With unpatched code some files would not be copied (small "unusual files"), KB/t value stayed at 2K (meaning reads always were sector sized), MB/s was about 1. With the patch all files were copied successfully and correctly (md5 verified), KB/t varied from 2K to 64K (apparently depending on size of currently copied file), MB/s varied from 1 to 4 which is not bad for these DVD-RAM disk and drive. If somebody asks I can produce a UDF image that would contain various test cases: large directory, unusual files, etc. Or you can give a try to udfclient/udfclientfs from sysutils/udfclient port. I hope this patch will be useful. -- Andriy Gapon --- udf_vnops.c.orig 2008-01-29 23:50:49.0 +0200 +++ udf_vnops.c 2008-02-07 00:12:34.0 +0200 @@ -168,6 +168,16 @@ udf_open(struct vop_open_args *ap) { struct udf_node *np = VTON(ap->a_vp); off_t fsize; + /* + * We currently do not support any other strategy type, so + * udf_bmap_internal, udf_bmap, udf_strategy would fail. + * I.e. we can not access content of this file anyway. + */ + if (le16toh(np->fentry->icbtag.strat_type) != 4) { + printf("Unsupported strategy type %d\n", le16toh(np->fentry->icbtag.strat_type)); + return ENODEV; + } + fsize = le64toh(np->fentry->inf_len); vnode_create_vobject(ap->a_vp, fsize, ap->a_td); return 0; @@ -340,7 +350,9 @@ udf_pathconf(struct vop_pathconf_args *a #define lblkno(udfmp, loc) ((loc) >> (udfmp)->bshift) #define blkoff(udfmp, loc) ((loc) & (udfmp)->bmask) -#define lblktosize(imp, blk) ((blk) << (udfmp)->bshift) +#define lblktosize(udfmp, blk) ((blk) << (udfmp)->bshift) + +#define HAS_EMBEDDED_DATA(node) ((le16toh((node)->fentry->icbtag.flags) & 0x7) == 3) static int udf_read(struct vop_read_args *ap) @@ -359,6 +371,26 @@ udf_read(struct vop_read_args *ap) return (0); if (uio->uio_offset < 0) return (EINVAL); + + if (HAS_EMBEDDED_DATA(node)) { + struct file_entry *fentry; + uint8_t *data; + + fentry = node->fentry; + data = &fentry->data[le32toh(fentr
Re: fs/udf: vm pages "overlap" while reading large dir [patch]
on 05/02/2008 22:43 Scott Long said the following: > Andriy Gapon wrote: > >> But there is another issue that I also mentioned in the email about >> directory reading. It is UDF_INVALID_BMAP case of udf_bmap_internal, >> i.e. the case when file data is embedded into a file entry. >> This is a special case that needs to be handled differently. >> udf_readatoffset() handles it, but the latest udf_read code doesn't. >> I have a real UDF filesystem where this type of allocation is used for >> small files and those files can not be read. > > Oh, so directory data can also follow this convention? Blah. Feel free > to fix that too if you want =-) Yes, this allocation type can be applied to both files and directories. Directories are already handled because of readatoffset() but udf_read needs to be made aware of this special case. -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: fs/udf: vm pages "overlap" while reading large dir [patch]
on 05/02/2008 20:16 Pav Lucistnik said the following: > Andriy Gapon píše v út 05. 02. 2008 v 16:40 +0200: > >>> Yay, and can you fix the sequential read performance while you're at it? >>> Kthx! > >> this was almost trivial :-) >> See the attached patch, first hunk is just for consistency. >> The code was borrowed from cd9660, only field/variable names are adjusted. > > Omg looks good. Can't wait for it to bubble throught to 6-STABLE :) > > Thanks million for working on this. > Actually the patch is not entirely correct. max_size returned from udf_bmap_internal should be used to calculate number of continuous sectors for read-ahead (as opposed to file size in the patch). -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: fs/udf: vm pages "overlap" while reading large dir
on 04/02/2008 17:55 Scott Long said the following: > Andriy Gapon wrote: [snip] >> After some code reading and run-time debugging, here are some facts >> about udf directory reading: >> 1. bread-ing is done via device vnode (as opposed to directory vnodes), >> as a consequence udf_strategy is not involved in directory reading >> 2. the device vnode has a non-NULL v_object, this means that VMIO is used [strike out incorrect guess] >> 4. bread is done on as much data as possible, up to MAXBSIZE [and even >> slightly more in some cases] (i.e. code determines directory data size >> and attempts to read in everything in one go) >> 5. physical sector number adjusted to DEV_BSIZE (512 bytes) sectors is >> passed to bread - this is because of #1 above and peculiarities of buf >> system >> 6. directory reading and file reading are quite different, because the >> latter does reading via file vnode >> >> Let's a consider a simple case of "directory reading" 5 * PAGE_SIZE (4K) >> bytes starting from a physical sector N with N%2 = 0 from media with >> sector size of 2K (most common CD/DVD case): >> 1. we want to read 12 KB = 3 pages = 6 sectors starting from sector N, >> N%2 = 0 >> 2. 4*N is passed as a sector number to bread >> 3. bo_bsize of the corresponding bufobj is a media sector size, i.e. 2K >> 4. actual read in bread will happen from b_iooffset of 4*N*DEV_BSIZE = >> N*2K, i.e. correct offset within the device >> 5. for a fresh read getblk will be called >> 6. getblk will set b_offset to blkno*bo_bsize=4*N*2K, i.e. 4 times the >> correct byte offset within the device >> 7. allocbuf will allocate pages using B_VMIO case >> 8. allocbuf calculates base page index as follows: pi = b_offset/PAGE_SIZE >> this means the following: >> sectors N and N+1 will be bound to a page with index 4*N*2K/4K = 2*N >> sectors N+2 and N+3 will be bound to the next page 2*N +1 >> sectors N+4 and N+5 is tied to the next page 2*N + 2 [insert] note that bstrategy will get correct params as b_iooffset is set to blkno*DEV_BSIZE, i.e. correct byte offset. >> >> Now let's consider a "directory read" of 2 sectors (1 page) starting at >> physical sector N+1. >> Using the same calculations as above, we see that the sector will be >> tied to a page 2*(N+1) = 2*N + 2. >> And this page already contains valid cached data from the previous read, >> *but* it contains data from sectors N+4 and N+5 instead of N+1 and N+2. >> >> So, we see, that because of b_offset being 4 times the actual byte >> offset, we get incorrect page indexing. Namely, sector X gets associated >> with different pages depending on what sector is used as a starting >> sector for bread. If bread starts at sector N, then data of sector N+1 >> is associated with (second half of) page with index 2*N; but if bread >> starts at sector N+1 then it gets associated with (the first half of) >> page with index 2*(N+1). [snip] > I think you forgot to attach the patch =-) This is an excellent > write-up, and I think you're on the right track. It's been nearly 8 > years since I wrote most of this code, so my memory is a bit fuzzy, but > I think the reason that I used the device vnode is because I was having > trouble with overlapping buffers when using the directory vnode, so this > was the easiest way to avoid that at the time. Since it sounds like > this is no longer a viable solution, I definitely support your ideas. [additional analysis of the problem, with new facts that might give rise to a different solution] Small summary of the above long description. For directory reading fs/udf performs bread() on a (underlying) device vnode. It passes block number as if block size was 512 bytes (i.e. byte_offset_within_dev/512). On the other hand vm page index calculation code uses the following formula: (blkno*bo_bsize)/PAGE_SIZE. For CD/DVD devices bo_bsize is set to 2048. This results in page index overlap when reading sufficiently many blocks from adjacent starting blocks. That is, it seems that in "normal" cases page index gets calculated as byte_offset/PAGE_SIZE, but in this case page index gets to be 4*byte_offset/PAGE_SIZE. More details are in the quoted above text. With a lot of help from Bruce Evance I found this commit which seems to have made a big difference: http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_bio.c.diff?r1=1.458;r2=1.459;f=h Before this change page index for device vnodes was always (blkno*512)/PAGE_SIZE. This is because of the vn_isdisk() case. udf_mountfs device vnode is passed to g_vfs_open() (in geom_vfs.c) and the latter has the following line of code: bo->bo_bsize = pp->sectorsize; Where pp is geom provider for the device in question. I am not sure if the mentioned above vfs_bio.c change broke something else in reading from device vnodes or if it fixed something for that. I am also not sure what would be the consequences of setting bo_bsize to 512 for vnodes of "disk" devices. It would most probably fix current code in UDF, but I am not sure if i
Re: fs/udf: vm pages "overlap" while reading large dir [patch]
Andriy Gapon wrote: on 04/02/2008 22:07 Pav Lucistnik said the following: Julian Elischer píše v po 04. 02. 2008 v 10:36 -0800: Andriy Gapon wrote: More on the problem with reading big directories on UDF. You do realise that you have now made yourself the official maintainer of the UDF file system by submitting a competent and insightful analysis of the problem? Yay, and can you fix the sequential read performance while you're at it? Kthx! Pav, this was almost trivial :-) Ok that settles it.. it's yours. See the attached patch, first hunk is just for consistency. The code was borrowed from cd9660, only field/variable names are adjusted. But there is another issue that I also mentioned in the email about directory reading. It is UDF_INVALID_BMAP case of udf_bmap_internal, i.e. the case when file data is embedded into a file entry. This is a special case that needs to be handled differently. udf_readatoffset() handles it, but the latest udf_read code doesn't. I have a real UDF filesystem where this type of allocation is used for small files and those files can not be read. This is described in Part 4, section 14.6.8 of ECMA-167. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: fs/udf: vm pages "overlap" while reading large dir [patch]
Andriy Gapon wrote: on 04/02/2008 22:07 Pav Lucistnik said the following: Julian Elischer píše v po 04. 02. 2008 v 10:36 -0800: Andriy Gapon wrote: More on the problem with reading big directories on UDF. You do realise that you have now made yourself the official maintainer of the UDF file system by submitting a competent and insightful analysis of the problem? Yay, and can you fix the sequential read performance while you're at it? Kthx! Pav, this was almost trivial :-) See the attached patch, first hunk is just for consistency. The code was borrowed from cd9660, only field/variable names are adjusted. Your patch looks reasonable. Btw, for the same reason that read-ahead makes file reading much faster, I would not change directory reading to be 1 sector at a time (unless you also do read-ahead for it). But there is another issue that I also mentioned in the email about directory reading. It is UDF_INVALID_BMAP case of udf_bmap_internal, i.e. the case when file data is embedded into a file entry. This is a special case that needs to be handled differently. udf_readatoffset() handles it, but the latest udf_read code doesn't. I have a real UDF filesystem where this type of allocation is used for small files and those files can not be read. Oh, so directory data can also follow this convention? Blah. Feel free to fix that too if you want =-) Scott ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: fs/udf: vm pages "overlap" while reading large dir [patch]
Andriy Gapon píše v út 05. 02. 2008 v 16:40 +0200: > > Yay, and can you fix the sequential read performance while you're at it? > > Kthx! > this was almost trivial :-) > See the attached patch, first hunk is just for consistency. > The code was borrowed from cd9660, only field/variable names are adjusted. Omg looks good. Can't wait for it to bubble throught to 6-STABLE :) Thanks million for working on this. -- Pav Lucistnik <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> What is the airspeed velocity of an unladen swallow? signature.asc Description: Toto je digitálně podepsaná část zprávy
Re: fs/udf: vm pages "overlap" while reading large dir [patch]
on 04/02/2008 20:36 Julian Elischer said the following: > Andriy Gapon wrote: >> More on the problem with reading big directories on UDF. > > You do realise that you have now made yourself the official > maintainer of the UDF file system by submitting a competent > and insightful analysis of the problem? I feel like I have to reply to this :-) First of all, thank you. But unfortunately my knowledge of core kernel is quite limited, I pick up some bits here and there by reading the code, but this is too little and sometimes my deductions are wrong. Another thing is very limited time. I do like fixing code, but I rarely have time and when I have it must be something very interesting and important to me. This time it is interesting :-) Because I actually use UDF quite a lot, I use DVD-RAM media for various archival and data transfer purposes, and sometimes there is some unusual case. I also try to keep up with FreeBSD development as much as possible, so sometimes there are new issues after upgrades. So, thank you again, but please think of me as of occasional patch submitter. -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: fs/udf: vm pages "overlap" while reading large dir [patch]
on 04/02/2008 22:07 Pav Lucistnik said the following: > Julian Elischer píše v po 04. 02. 2008 v 10:36 -0800: >> Andriy Gapon wrote: >>> More on the problem with reading big directories on UDF. >> You do realise that you have now made yourself the official >> maintainer of the UDF file system by submitting a competent >> and insightful analysis of the problem? > > Yay, and can you fix the sequential read performance while you're at it? > Kthx! > Pav, this was almost trivial :-) See the attached patch, first hunk is just for consistency. The code was borrowed from cd9660, only field/variable names are adjusted. But there is another issue that I also mentioned in the email about directory reading. It is UDF_INVALID_BMAP case of udf_bmap_internal, i.e. the case when file data is embedded into a file entry. This is a special case that needs to be handled differently. udf_readatoffset() handles it, but the latest udf_read code doesn't. I have a real UDF filesystem where this type of allocation is used for small files and those files can not be read. This is described in Part 4, section 14.6.8 of ECMA-167. -- Andriy Gapon --- udf_vnops.c.orig 2008-01-29 23:50:49.0 +0200 +++ udf_vnops.c 2008-02-05 01:30:23.0 +0200 @@ -851,7 +846,7 @@ udf_bmap(struct vop_bmap_args *a) if (a->a_runb) *a->a_runb = 0; - error = udf_bmap_internal(node, a->a_bn * node->udfmp->bsize, &lsector, + error = udf_bmap_internal(node, a->a_bn << node->udfmp->bshift, &lsector, &max_size); if (error) return (error); @@ -859,9 +854,27 @@ udf_bmap(struct vop_bmap_args *a) /* Translate logical to physical sector number */ *a->a_bnp = lsector << (node->udfmp->bshift - DEV_BSHIFT); - /* Punt on read-ahead for now */ - if (a->a_runp) - *a->a_runp = 0; + /* + * Determine maximum number of readahead blocks following the + * requested block. + */ + if (a->a_runp) { + off_t fsize; + int nblk; + + fsize = le64toh(node->fentry->inf_len); + nblk = (fsize >> node->udfmp->bshift) - (a->a_bn + 1); + if (nblk <= 0) + *a->a_runp = 0; + else if (nblk >= (MAXBSIZE >> node->udfmp->bshift)) + *a->a_runp = (MAXBSIZE >> node->udfmp->bshift) - 1; + else + *a->a_runp = nblk; + } + + if (a->a_runb) { + *a->a_runb = 0; + } return (0); } --- udf_vfsops.c.orig 2007-03-13 03:50:24.0 +0200 +++ udf_vfsops.c 2008-02-05 01:29:10.0 +0200 @@ -330,6 +330,11 @@ udf_mountfs(struct vnode *devvp, struct bo = &devvp->v_bufobj; + if (devvp->v_rdev->si_iosize_max != 0) + mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max; + if (mp->mnt_iosize_max > MAXPHYS) + mp->mnt_iosize_max = MAXPHYS; + /* XXX: should be M_WAITOK */ MALLOC(udfmp, struct udf_mnt *, sizeof(struct udf_mnt), M_UDFMOUNT, M_NOWAIT | M_ZERO); ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: fs/udf: vm pages "overlap" while reading large dir [patch]
Julian Elischer píše v po 04. 02. 2008 v 10:36 -0800: > Andriy Gapon wrote: > > More on the problem with reading big directories on UDF. > > You do realise that you have now made yourself the official > maintainer of the UDF file system by submitting a competent > and insightful analysis of the problem? Yay, and can you fix the sequential read performance while you're at it? Kthx! -- Pav Lucistnik <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> Squish. Larger than the normal icky things, and twice as icky. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: fs/udf: vm pages "overlap" while reading large dir [patch]
Andriy Gapon wrote: More on the problem with reading big directories on UDF. You do realise that you have now made yourself the official maintainer of the UDF file system by submitting a competent and insightful analysis of the problem? First, some sleuthing. I came to believe that the problem is caused by some larger change in vfs/vm/buf area. It seems that now VMIO is applied to more vnode types than before. In particular it seems that now vnodes for devices have non-NULL v_object (or maybe this is about directory vnodes, I am not sure). Also it seems that the problem should happen for any directory with size larger than four 2048-bytes sectors (I think that any directory with > 300 files would definitely qualify). After some code reading and run-time debugging, here are some facts about udf directory reading: 1. bread-ing is done via device vnode (as opposed to directory vnodes), as a consequence udf_strategy is not involved in directory reading 2. the device vnode has a non-NULL v_object, this means that VMIO is used 3. it seems that the code assumed that non-VM buffers are used 4. bread is done on as much data as possible, up to MAXBSIZE [and even slightly more in some cases] (i.e. code determines directory data size and attempts to read in everything in one go) 5. physical sector number adjusted to DEV_BSIZE (512 bytes) sectors is passed to bread - this is because of #1 above and peculiarities of buf system 6. directory reading and file reading are quite different, because the latter does reading via file vnode Let's a consider a simple case of "directory reading" 5 * PAGE_SIZE (4K) bytes starting from a physical sector N with N%2 = 0 from media with sector size of 2K (most common CD/DVD case): 1. we want to read 12 KB = 3 pages = 6 sectors starting from sector N, N%2 = 0 2. 4*N is passed as a sector number to bread 3. bo_bsize of the corresponding bufobj is a media sector size, i.e. 2K 4. actual read in bread will happen from b_iooffset of 4*N*DEV_BSIZE = N*2K, i.e. correct offset within the device 5. for a fresh read getblk will be called 6. getblk will set b_offset to blkno*bo_bsize=4*N*2K, i.e. 4 times the correct byte offset within the device 7. allocbuf will allocate pages using B_VMIO case 8. allocbuf calculates base page index as follows: pi = b_offset/PAGE_SIZE this means the following: sectors N and N+1 will be bound to a page with index 4*N*2K/4K = 2*N sectors N+2 and N+3 will be bound to the next page 2*N +1 sectors N+4 and N+5 is tied to the next page 2*N + 2 Now let's consider a "directory read" of 2 sectors (1 page) starting at physical sector N+1. Using the same calculations as above, we see that the sector will be tied to a page 2*(N+1) = 2*N + 2. And this page already contains valid cached data from the previous read, *but* it contains data from sectors N+4 and N+5 instead of N+1 and N+2. So, we see, that because of b_offset being 4 times the actual byte offset, we get incorrect page indexing. Namely, sector X gets associated with different pages depending on what sector is used as a starting sector for bread. If bread starts at sector N, then data of sector N+1 is associated with (second half of) page with index 2*N; but if bread starts at sector N+1 then it gets associated with (the first half of) page with index 2*(N+1). Previously, before VMIO change, data for all reads was put into separate buffers that did not share anything between them. So the problem was limited only to wasting memory with duplicate data, but no actual over-runs did happen. Now we have the over-runs because the VM pages are shared between the buffers of the same vnode. One obvious solution is to limit bread size to 2*PAGE_SIZE = 4 * sector_size. In this case, as before, we would waste some memory on duplicate data but we would avoid page overruns. If we limit bread size even more, to 1 sector, then we would not have any duplicate data at all. But there would still be some resource waste - each page would correspond to one sector, so 4K page would have only 2K of valid data and the other half in each page is unused. Another solution, which to me seems to be better, is to do "usual" reading - pass a directory vnode and 2048-byte sector offset to bread. In this case udf_strategy will get called for bklno translation, all offsets and indexes will be correct and everything will work perfectly as it is the case for all other filesystems. The only overhead in this case comes from the need to handle UDF_INVALID_BMAP case (where data is embedded into file entry). So it means that we have to call bmap_internal to see if we have that special case and hanlde it, and if the case is not special we would call bread on a directory vnode which means that bmap_internal would be called again in udf_strategy. One optimization that can be done in this case is to create a ligher version of bmap_internal that would merely check for the special case and wouldn't do anything else. I am attaching a patch based on the
Re: fs/udf: vm pages "overlap" while reading large dir
Andriy Gapon wrote: More on the problem with reading big directories on UDF. First, some sleuthing. I came to believe that the problem is caused by some larger change in vfs/vm/buf area. It seems that now VMIO is applied to more vnode types than before. In particular it seems that now vnodes for devices have non-NULL v_object (or maybe this is about directory vnodes, I am not sure). Also it seems that the problem should happen for any directory with size larger than four 2048-bytes sectors (I think that any directory with > 300 files would definitely qualify). After some code reading and run-time debugging, here are some facts about udf directory reading: 1. bread-ing is done via device vnode (as opposed to directory vnodes), as a consequence udf_strategy is not involved in directory reading 2. the device vnode has a non-NULL v_object, this means that VMIO is used 3. it seems that the code assumed that non-VM buffers are used 4. bread is done on as much data as possible, up to MAXBSIZE [and even slightly more in some cases] (i.e. code determines directory data size and attempts to read in everything in one go) 5. physical sector number adjusted to DEV_BSIZE (512 bytes) sectors is passed to bread - this is because of #1 above and peculiarities of buf system 6. directory reading and file reading are quite different, because the latter does reading via file vnode Let's a consider a simple case of "directory reading" 5 * PAGE_SIZE (4K) bytes starting from a physical sector N with N%2 = 0 from media with sector size of 2K (most common CD/DVD case): 1. we want to read 12 KB = 3 pages = 6 sectors starting from sector N, N%2 = 0 2. 4*N is passed as a sector number to bread 3. bo_bsize of the corresponding bufobj is a media sector size, i.e. 2K 4. actual read in bread will happen from b_iooffset of 4*N*DEV_BSIZE = N*2K, i.e. correct offset within the device 5. for a fresh read getblk will be called 6. getblk will set b_offset to blkno*bo_bsize=4*N*2K, i.e. 4 times the correct byte offset within the device 7. allocbuf will allocate pages using B_VMIO case 8. allocbuf calculates base page index as follows: pi = b_offset/PAGE_SIZE this means the following: sectors N and N+1 will be bound to a page with index 4*N*2K/4K = 2*N sectors N+2 and N+3 will be bound to the next page 2*N +1 sectors N+4 and N+5 is tied to the next page 2*N + 2 Now let's consider a "directory read" of 2 sectors (1 page) starting at physical sector N+1. Using the same calculations as above, we see that the sector will be tied to a page 2*(N+1) = 2*N + 2. And this page already contains valid cached data from the previous read, *but* it contains data from sectors N+4 and N+5 instead of N+1 and N+2. So, we see, that because of b_offset being 4 times the actual byte offset, we get incorrect page indexing. Namely, sector X gets associated with different pages depending on what sector is used as a starting sector for bread. If bread starts at sector N, then data of sector N+1 is associated with (second half of) page with index 2*N; but if bread starts at sector N+1 then it gets associated with (the first half of) page with index 2*(N+1). Previously, before VMIO change, data for all reads was put into separate buffers that did not share anything between them. So the problem was limited only to wasting memory with duplicate data, but no actual over-runs did happen. Now we have the over-runs because the VM pages are shared between the buffers of the same vnode. One obvious solution is to limit bread size to 2*PAGE_SIZE = 4 * sector_size. In this case, as before, we would waste some memory on duplicate data but we would avoid page overruns. If we limit bread size even more, to 1 sector, then we would not have any duplicate data at all. But there would still be some resource waste - each page would correspond to one sector, so 4K page would have only 2K of valid data and the other half in each page is unused. Another solution, which to me seems to be better, is to do "usual" reading - pass a directory vnode and 2048-byte sector offset to bread. In this case udf_strategy will get called for bklno translation, all offsets and indexes will be correct and everything will work perfectly as it is the case for all other filesystems. The only overhead in this case comes from the need to handle UDF_INVALID_BMAP case (where data is embedded into file entry). So it means that we have to call bmap_internal to see if we have that special case and hanlde it, and if the case is not special we would call bread on a directory vnode which means that bmap_internal would be called again in udf_strategy. One optimization that can be done in this case is to create a ligher version of bmap_internal that would merely check for the special case and wouldn't do anything else. I am attaching a patch based on the ideas above. It fixes the problem for me and doesn't seem to create any new ones :-) About the patch: hunk #1 fixes a copy/paste hunk #2 should fixes strategy to n
fs/udf: vm pages "overlap" while reading large dir [patch]
More on the problem with reading big directories on UDF. First, some sleuthing. I came to believe that the problem is caused by some larger change in vfs/vm/buf area. It seems that now VMIO is applied to more vnode types than before. In particular it seems that now vnodes for devices have non-NULL v_object (or maybe this is about directory vnodes, I am not sure). Also it seems that the problem should happen for any directory with size larger than four 2048-bytes sectors (I think that any directory with > 300 files would definitely qualify). After some code reading and run-time debugging, here are some facts about udf directory reading: 1. bread-ing is done via device vnode (as opposed to directory vnodes), as a consequence udf_strategy is not involved in directory reading 2. the device vnode has a non-NULL v_object, this means that VMIO is used 3. it seems that the code assumed that non-VM buffers are used 4. bread is done on as much data as possible, up to MAXBSIZE [and even slightly more in some cases] (i.e. code determines directory data size and attempts to read in everything in one go) 5. physical sector number adjusted to DEV_BSIZE (512 bytes) sectors is passed to bread - this is because of #1 above and peculiarities of buf system 6. directory reading and file reading are quite different, because the latter does reading via file vnode Let's a consider a simple case of "directory reading" 5 * PAGE_SIZE (4K) bytes starting from a physical sector N with N%2 = 0 from media with sector size of 2K (most common CD/DVD case): 1. we want to read 12 KB = 3 pages = 6 sectors starting from sector N, N%2 = 0 2. 4*N is passed as a sector number to bread 3. bo_bsize of the corresponding bufobj is a media sector size, i.e. 2K 4. actual read in bread will happen from b_iooffset of 4*N*DEV_BSIZE = N*2K, i.e. correct offset within the device 5. for a fresh read getblk will be called 6. getblk will set b_offset to blkno*bo_bsize=4*N*2K, i.e. 4 times the correct byte offset within the device 7. allocbuf will allocate pages using B_VMIO case 8. allocbuf calculates base page index as follows: pi = b_offset/PAGE_SIZE this means the following: sectors N and N+1 will be bound to a page with index 4*N*2K/4K = 2*N sectors N+2 and N+3 will be bound to the next page 2*N +1 sectors N+4 and N+5 is tied to the next page 2*N + 2 Now let's consider a "directory read" of 2 sectors (1 page) starting at physical sector N+1. Using the same calculations as above, we see that the sector will be tied to a page 2*(N+1) = 2*N + 2. And this page already contains valid cached data from the previous read, *but* it contains data from sectors N+4 and N+5 instead of N+1 and N+2. So, we see, that because of b_offset being 4 times the actual byte offset, we get incorrect page indexing. Namely, sector X gets associated with different pages depending on what sector is used as a starting sector for bread. If bread starts at sector N, then data of sector N+1 is associated with (second half of) page with index 2*N; but if bread starts at sector N+1 then it gets associated with (the first half of) page with index 2*(N+1). Previously, before VMIO change, data for all reads was put into separate buffers that did not share anything between them. So the problem was limited only to wasting memory with duplicate data, but no actual over-runs did happen. Now we have the over-runs because the VM pages are shared between the buffers of the same vnode. One obvious solution is to limit bread size to 2*PAGE_SIZE = 4 * sector_size. In this case, as before, we would waste some memory on duplicate data but we would avoid page overruns. If we limit bread size even more, to 1 sector, then we would not have any duplicate data at all. But there would still be some resource waste - each page would correspond to one sector, so 4K page would have only 2K of valid data and the other half in each page is unused. Another solution, which to me seems to be better, is to do "usual" reading - pass a directory vnode and 2048-byte sector offset to bread. In this case udf_strategy will get called for bklno translation, all offsets and indexes will be correct and everything will work perfectly as it is the case for all other filesystems. The only overhead in this case comes from the need to handle UDF_INVALID_BMAP case (where data is embedded into file entry). So it means that we have to call bmap_internal to see if we have that special case and hanlde it, and if the case is not special we would call bread on a directory vnode which means that bmap_internal would be called again in udf_strategy. One optimization that can be done in this case is to create a ligher version of bmap_internal that would merely check for the special case and wouldn't do anything else. I am attaching a patch based on the ideas above. It fixes the problem for me and doesn't seem to create any new ones :-) About the patch: hunk #1 fixes a copy/paste hunk #2 should fixes strategy to not set junk b_blkno i
fs/udf: vm pages "overlap" while reading large dir
More on the problem with reading big directories on UDF. First, some sleuthing. I came to believe that the problem is caused by some larger change in vfs/vm/buf area. It seems that now VMIO is applied to more vnode types than before. In particular it seems that now vnodes for devices have non-NULL v_object (or maybe this is about directory vnodes, I am not sure). Also it seems that the problem should happen for any directory with size larger than four 2048-bytes sectors (I think that any directory with > 300 files would definitely qualify). After some code reading and run-time debugging, here are some facts about udf directory reading: 1. bread-ing is done via device vnode (as opposed to directory vnodes), as a consequence udf_strategy is not involved in directory reading 2. the device vnode has a non-NULL v_object, this means that VMIO is used 3. it seems that the code assumed that non-VM buffers are used 4. bread is done on as much data as possible, up to MAXBSIZE [and even slightly more in some cases] (i.e. code determines directory data size and attempts to read in everything in one go) 5. physical sector number adjusted to DEV_BSIZE (512 bytes) sectors is passed to bread - this is because of #1 above and peculiarities of buf system 6. directory reading and file reading are quite different, because the latter does reading via file vnode Let's a consider a simple case of "directory reading" 5 * PAGE_SIZE (4K) bytes starting from a physical sector N with N%2 = 0 from media with sector size of 2K (most common CD/DVD case): 1. we want to read 12 KB = 3 pages = 6 sectors starting from sector N, N%2 = 0 2. 4*N is passed as a sector number to bread 3. bo_bsize of the corresponding bufobj is a media sector size, i.e. 2K 4. actual read in bread will happen from b_iooffset of 4*N*DEV_BSIZE = N*2K, i.e. correct offset within the device 5. for a fresh read getblk will be called 6. getblk will set b_offset to blkno*bo_bsize=4*N*2K, i.e. 4 times the correct byte offset within the device 7. allocbuf will allocate pages using B_VMIO case 8. allocbuf calculates base page index as follows: pi = b_offset/PAGE_SIZE this means the following: sectors N and N+1 will be bound to a page with index 4*N*2K/4K = 2*N sectors N+2 and N+3 will be bound to the next page 2*N +1 sectors N+4 and N+5 is tied to the next page 2*N + 2 Now let's consider a "directory read" of 2 sectors (1 page) starting at physical sector N+1. Using the same calculations as above, we see that the sector will be tied to a page 2*(N+1) = 2*N + 2. And this page already contains valid cached data from the previous read, *but* it contains data from sectors N+4 and N+5 instead of N+1 and N+2. So, we see, that because of b_offset being 4 times the actual byte offset, we get incorrect page indexing. Namely, sector X gets associated with different pages depending on what sector is used as a starting sector for bread. If bread starts at sector N, then data of sector N+1 is associated with (second half of) page with index 2*N; but if bread starts at sector N+1 then it gets associated with (the first half of) page with index 2*(N+1). Previously, before VMIO change, data for all reads was put into separate buffers that did not share anything between them. So the problem was limited only to wasting memory with duplicate data, but no actual over-runs did happen. Now we have the over-runs because the VM pages are shared between the buffers of the same vnode. One obvious solution is to limit bread size to 2*PAGE_SIZE = 4 * sector_size. In this case, as before, we would waste some memory on duplicate data but we would avoid page overruns. If we limit bread size even more, to 1 sector, then we would not have any duplicate data at all. But there would still be some resource waste - each page would correspond to one sector, so 4K page would have only 2K of valid data and the other half in each page is unused. Another solution, which to me seems to be better, is to do "usual" reading - pass a directory vnode and 2048-byte sector offset to bread. In this case udf_strategy will get called for bklno translation, all offsets and indexes will be correct and everything will work perfectly as it is the case for all other filesystems. The only overhead in this case comes from the need to handle UDF_INVALID_BMAP case (where data is embedded into file entry). So it means that we have to call bmap_internal to see if we have that special case and hanlde it, and if the case is not special we would call bread on a directory vnode which means that bmap_internal would be called again in udf_strategy. One optimization that can be done in this case is to create a ligher version of bmap_internal that would merely check for the special case and wouldn't do anything else. I am attaching a patch based on the ideas above. It fixes the problem for me and doesn't seem to create any new ones :-) About the patch: hunk #1 fixes a copy/paste hunk #2 should fixes strategy to not set junk b_blkno i