Re: fs/udf: vm pages "overlap" while reading large dir [patch]

2008-02-28 Thread Scott Long

Andriy Gapon wrote:

on 26/02/2008 21:23 Pav Lucistnik said the following:

Pav Lucistnik píše v út 05. 02. 2008 v 19:16 +0100:

Andriy Gapon píše v út 05. 02. 2008 v 16:40 +0200:


Yay, and can you fix the sequential read performance while you're at it?
Kthx!

this was almost trivial :-)
See the attached patch, first hunk is just for consistency.
The code was borrowed from cd9660, only field/variable names are adjusted.

Just tested it with my shiny new Bluray drive, and it work wonders.
Finally seamless playback of media files off UDF carrying media.

So, how does it look WRT committing it?



Pav,

thank you for the feedback/reminder.
In my personal option the latest patch posted and described at the
following like is a very good candidate for commit:
http://docs.FreeBSD.org/cgi/mid.cgi?47AA43B9.1040608
It might have a couple of style related issues, but it should improve
things a bit even if some larger VM/VFS/GEOM issues remain.

And by the way, a patch from the following PR would be a good
"side-dish" for the above patch:
http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/78987
I think it is simple and obvious enough.



I will commit both of these to CVS today.  Thanks again.

Scott

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: fs/udf: vm pages "overlap" while reading large dir [patch]

2008-02-28 Thread Scott Long

Pav Lucistnik wrote:

Andriy Gapon píše v čt 28. 02. 2008 v 10:33 +0200:

And while I have your attention, I have a related question.

I have produced a bunch of ISO9660 Level 3 / UDF hybrid media with
mkisofs, and when I mount the UDF part of them, the mount point (root
directory of media) have 0x000 permissions. Yes that's right, d-
in ls -l. That makes the whole volume inaccessible for everyone except
root.

Is this something you can mend in our UDF driver, or should I go dig
inside mkisofs guts? Windows handle these media without any visible
problems.



I wonder if Windows even observes the permissions bits.  You'd have to
special-case the UDF driver code in FreeBSD, which certainly possible
but not terribly attractive.  I'd be interested to see what exactly
mkiso is doing.  Maybe it's putting permissions into extended attributes
and assuming the filesystem driver will read those instead.

Scott
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: fs/udf: vm pages "overlap" while reading large dir [patch]

2008-02-28 Thread Andriy Gapon
on 28/02/2008 11:59 Pav Lucistnik said the following:
> Andriy Gapon píše v čt 28. 02. 2008 v 10:33 +0200:
> 
> And while I have your attention, I have a related question.
> 
> I have produced a bunch of ISO9660 Level 3 / UDF hybrid media with
> mkisofs, and when I mount the UDF part of them, the mount point (root
> directory of media) have 0x000 permissions. Yes that's right, d-
> in ls -l. That makes the whole volume inaccessible for everyone except
> root.
> 
> Is this something you can mend in our UDF driver, or should I go dig
> inside mkisofs guts? Windows handle these media without any visible
> problems.
> 

I think this ought to be something in mkisofs, last I checked our code
was fully conforming to the specs in this regard.
And obviously it works with windoz, because it simply ignores the
permissions :-)
We might consider adding a fixup for such broken media (if my assessment
is correct), but I'd rather prefer that it is fixed at the source.

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: fs/udf: vm pages "overlap" while reading large dir [patch]

2008-02-28 Thread Pav Lucistnik
Andriy Gapon píše v čt 28. 02. 2008 v 10:33 +0200:

And while I have your attention, I have a related question.

I have produced a bunch of ISO9660 Level 3 / UDF hybrid media with
mkisofs, and when I mount the UDF part of them, the mount point (root
directory of media) have 0x000 permissions. Yes that's right, d-
in ls -l. That makes the whole volume inaccessible for everyone except
root.

Is this something you can mend in our UDF driver, or should I go dig
inside mkisofs guts? Windows handle these media without any visible
problems.

-- 
Pav Lucistnik <[EMAIL PROTECTED]>
  <[EMAIL PROTECTED]>

How will you recognize experienced hacker from beginner?
Beginner thinks that kilobyte have 1000 bytes.
Experienced hacker thinks one kilometer have 1024 meters.


signature.asc
Description: Toto je digitálně	 podepsaná část	 zprávy


Re: fs/udf: vm pages "overlap" while reading large dir [patch]

2008-02-28 Thread Andriy Gapon
on 26/02/2008 21:23 Pav Lucistnik said the following:
> Pav Lucistnik píše v út 05. 02. 2008 v 19:16 +0100:
>> Andriy Gapon píše v út 05. 02. 2008 v 16:40 +0200:
>>
 Yay, and can you fix the sequential read performance while you're at it?
 Kthx!
>>> this was almost trivial :-)
>>> See the attached patch, first hunk is just for consistency.
>>> The code was borrowed from cd9660, only field/variable names are adjusted.
> 
> Just tested it with my shiny new Bluray drive, and it work wonders.
> Finally seamless playback of media files off UDF carrying media.
> 
> So, how does it look WRT committing it?
> 

Pav,

thank you for the feedback/reminder.
In my personal option the latest patch posted and described at the
following like is a very good candidate for commit:
http://docs.FreeBSD.org/cgi/mid.cgi?47AA43B9.1040608
It might have a couple of style related issues, but it should improve
things a bit even if some larger VM/VFS/GEOM issues remain.

And by the way, a patch from the following PR would be a good
"side-dish" for the above patch:
http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/78987
I think it is simple and obvious enough.

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: fs/udf: vm pages "overlap" while reading large dir [patch]

2008-02-26 Thread Pav Lucistnik
Pav Lucistnik píše v út 05. 02. 2008 v 19:16 +0100:
> Andriy Gapon píše v út 05. 02. 2008 v 16:40 +0200:
> 
> > > Yay, and can you fix the sequential read performance while you're at it?
> > > Kthx!
> 
> > this was almost trivial :-)
> > See the attached patch, first hunk is just for consistency.
> > The code was borrowed from cd9660, only field/variable names are adjusted.

Just tested it with my shiny new Bluray drive, and it work wonders.
Finally seamless playback of media files off UDF carrying media.

So, how does it look WRT committing it?

-- 
Pav Lucistnik <[EMAIL PROTECTED]>
  <[EMAIL PROTECTED]>
Somebody ought to cross ball point pens with coat hangers so that the
pens will multiply instead of disappear.


signature.asc
Description: Toto je digitálně	 podepsaná část	 zprávy


Re: fs/udf: vm pages "overlap" while reading large dir

2008-02-12 Thread Bruce Evans

On Tue, 12 Feb 2008, Andriy Gapon wrote:


on 12/02/2008 15:11 Bruce Evans said the following:

On Tue, 12 Feb 2008, Poul-Henning Kamp wrote:


In message <[EMAIL PROTECTED]>, Andriy Gapon writes:



3.1. for a fresh buf getlbk would assign the following:
bsize = bo->bo_bsize;
offset = blkno * bsize;

That's clearly wrong.


If units were always 512-blocks, then anything except bsize = DEV_BSIZE
would be clearly wrong.  Things aren't that simple (but probably should
be).  Even RELENG_3 has bsize = f_iosize (possibly != 512) for non-disks.


O, I missed this obvious thing!


Me too.


Bruce, thank you for putting me back on the ground :-)
Even in UDF case, when we bread() via a file or directory vnode we pass
a logical 2048-byte block number (within the file) to bread(). In this
case the code in getblk() does the right thing when it calculates offset
as blkno * 2048. Calculating it assuming 512-byte units would be a disaster.


I think the size is mnt_stat.f_iosize in general (but not for device vnodes).


And the actual reading works correctly because udf_strategy is called
for such vnodes and it translates block numbers from physical to logical
and also correctly re-calculates b_iooffset for actual reading. So
b_iooffset value set in breadn() (using bdtob) is completely ignored.


Is this setting ever used (and the corresponding setting for writing)
ever used?


I remember that this is why g_vfs_* was my primary target.
It seems that I revert my opinion back: either g_vfs_open should be
smart about setting bo_bsize, or g_vfs_strategy should be smart about
calculating bio_offset, or individual filesystems should not depend on
g_vfs_* always doing the best thing for them.


In fact, ffs already overrides the setting of bo_bsize for the device
vnode to a different wrong setting -- g_vfs_open() sets the sector size,
and ffs_mount() changes the setting to fs_bsize, but ffs actually needs
the setting to be DEV_BSIZE (I think).  Other bugs from this:
- ffs_rawread wants the sector size, and it assumes that this is in bo_bsize
  for the device vnode, but ffs_mount() has changed this to fs_bsize.
- multiple r/o mounts are supposed to work, but don't, since there is only
  one device vnode with a shared bufobj, but the bufobj needs to be
  per-file-system since all mounts write to it.  Various bad things
  including panics result.  There is a well-know panic from bo_private
  becoming garbage on unmount.  I just noticed more possibilities for
  panics.  bo_ops points to static storage, so it never becomes complete
  garbage.  However, at least ffs sets it blindly early on in
  ffs_mountfs(), before looking at the file system to see if ffs can
  mount it.  Thus if the file system is already mounted by another
  ffs, then ffs clobbers the other fs's bo_ops.  The ffs mount will
  presumably fail, leaving bo_ops clobbered.  Also, a successful
  g_vfs_open() has clobbered bo_ops, bo_private and bo_bsize a little
  earlier.  g_vfs_open() is fundamental to looking at the file system,
  since I/O is not set up until it completes.  Clobbering the pointers
  is most dangerous, but just clobbering bo_bsize breaks blkno
  calculations for any code that uses bo_bsize.

Apart from these bugs, the fix for the blkno calculations for device
vp's may be as simple as setting bo_bsize to DEV_BSIZE for the device
vp of all disk file systems (since all disk file systems use expect
this size).  Currently, ffs seems to be the only file system that
overrides g_vfs_open()'s default of the sector size.  Nothing in any
of fs/, gnu/fs/ and contrib/ references bo_bsize.


In any case, it seems that it is the g_vfs_* that is currently
inconsistent: it sets bo_bsize to a somewhat arbitrary value, but
expects that it would always be provided with correct b_iooffset (the
latter being rigidly calculated via bdtob() in buffcache code). So this
leaves filesystems dead in the water while reading via device vnode:
provide blkno in 512-byte units - screw up VM cache, provide blkno in
bo_bsize units - screw up reading from disk.


I think I/O on the disk doesn't use bo_bsize, but only the sector size
(to scale the byte count to a sector count).  Otherwise, ffs's override
would break I/O.  geom is another place that has few references to
bo_bsize -- it just clobbers it in g_vfs_open().


Not sure if the FS-s should have wits to set bo_bsize properly and/or
provide proper bo_ops - we are talking about a device vnode here.
Should filesystems be involved in the business of setting its
properties? Not sure.


Yes.  They can set it more easily as they can tell g_vfs_open() what to
set it to.  Except, until the bugs are fixed properly, g_vfs_open() can
more easily do tests to prevent clobbering.  For bo_bsize and bo_ops,
sharing a common value is safe and safe changes can be detected by
setting to a special value on last unmount.  For bo_private, a safety
check would probably disallow multiple mounts (since cp is dynamically
allocated (?)).

Bruce
_

Re: fs/udf: vm pages "overlap" while reading large dir

2008-02-12 Thread Andriy Gapon
on 12/02/2008 17:58 Bruce Evans said the following:
> On Tue, 12 Feb 2008, Andriy Gapon wrote:
>> And the actual reading works correctly because udf_strategy is called
>> for such vnodes and it translates block numbers from physical to logical
>> and also correctly re-calculates b_iooffset for actual reading. So
>> b_iooffset value set in breadn() (using bdtob) is completely ignored.
> 
> Is this setting ever used (and the corresponding setting for writing)
> ever used?

Bruce,

replying only to this part (digesting the others): yes, it is used by
g_vfs_strategy which is a bufobj startegy after g_vfs_open. It passes
b_iooffset as is to bio_offset.

P.S. of course, I am speaking about the current sources - I know you
have almost an OS of your own, kidding :-)

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: fs/udf: vm pages "overlap" while reading large dir

2008-02-12 Thread Andriy Gapon
on 12/02/2008 15:11 Bruce Evans said the following:
> On Tue, 12 Feb 2008, Poul-Henning Kamp wrote:
> 
>> In message <[EMAIL PROTECTED]>, Andriy Gapon writes:
>>
>>> 2.3. this code passes to bread blkno that is calculated as 4*sector,
>>> where sector is a number of a physical 2048-byte sector. [**]
>>> [**] - I think that this is a requirement of buffcache system, because
>>> internally it performs many calculations that seem to assume that block
>>> size is always 512.
>> Yes, the buf-cache works in 512 bytes units throughout.
> 
> I missed the dbtob() conversions in vfs_bio.c when I replied previously
> So blkno cannot be a cookie; it must be for a 512-block.  So how did
> the cases where bsize != DEV_BSIZE in getblk() ever work?

It seems that this is because specific VOP_STRATEGY and/or BO_STRATEGY
kicked in and did the right thing, see below.

>>> 3.1. for a fresh buf getlbk would assign the following:
>>> bsize = bo->bo_bsize;
>>> offset = blkno * bsize;
>> That's clearly wrong.
> 
> If units were always 512-blocks, then anything except bsize = DEV_BSIZE
> would be clearly wrong.  Things aren't that simple (but probably should
> be).  Even RELENG_3 has bsize = f_iosize (possibly != 512) for non-disks.

O, I missed this obvious thing!
Bruce, thank you for putting me back on the ground :-)
Even in UDF case, when we bread() via a file or directory vnode we pass
a logical 2048-byte block number (within the file) to bread(). In this
case the code in getblk() does the right thing when it calculates offset
as blkno * 2048. Calculating it assuming 512-byte units would be a disaster.
And the actual reading works correctly because udf_strategy is called
for such vnodes and it translates block numbers from physical to logical
and also correctly re-calculates b_iooffset for actual reading. So
b_iooffset value set in breadn() (using bdtob) is completely ignored.

I remember that this is why g_vfs_* was my primary target.
It seems that I revert my opinion back: either g_vfs_open should be
smart about setting bo_bsize, or g_vfs_strategy should be smart about
calculating bio_offset, or individual filesystems should not depend on
g_vfs_* always doing the best thing for them.

In any case, it seems that it is the g_vfs_* that is currently
inconsistent: it sets bo_bsize to a somewhat arbitrary value, but
expects that it would always be provided with correct b_iooffset (the
latter being rigidly calculated via bdtob() in buffcache code). So this
leaves filesystems dead in the water while reading via device vnode:
provide blkno in 512-byte units - screw up VM cache, provide blkno in
bo_bsize units - screw up reading from disk.

Not sure if the FS-s should have wits to set bo_bsize properly and/or
provide proper bo_ops - we are talking about a device vnode here.
Should filesystems be involved in the business of setting its
properties? Not sure.
But it seems that there are many possibilities for "fixups" and various
filesystems [have to] do stuff in their unique ways (*_STRATEGY, etc).

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: fs/udf: vm pages "overlap" while reading large dir

2008-02-12 Thread Bruce Evans

On Tue, 12 Feb 2008, Poul-Henning Kamp wrote:


In message <[EMAIL PROTECTED]>, Andriy Gapon writes:


2.3. this code passes to bread blkno that is calculated as 4*sector,
where sector is a number of a physical 2048-byte sector. [**]
[**] - I think that this is a requirement of buffcache system, because
internally it performs many calculations that seem to assume that block
size is always 512.


Yes, the buf-cache works in 512 bytes units throughout.


I missed the dbtob() conversions in vfs_bio.c when I replied previously
So blkno cannot be a cookie; it must be for a 512-block.  So how did
the cases where bsize != DEV_BSIZE in getblk() ever work?


3.1. for a fresh buf getlbk would assign the following:
bsize = bo->bo_bsize;
offset = blkno * bsize;


That's clearly wrong.


If units were always 512-blocks, then anything except bsize = DEV_BSIZE
would be clearly wrong.  Things aren't that simple (but probably should
be).  Even RELENG_3 has bsize = f_iosize (possibly != 512) for non-disks.
That seems to include nfs(client).  In fact, nfs_getcacheblk() does
weird scaling which seems to be mainly to compensate for for weird scaling
here.  It calls getblk() with a bn arg that seems to be f_iosize units.
Then at then end, for the VREG case, it sets bp->b_blkno to this bn
scaled to normal DEV_BSIZE units.  bp->b_blkno seems to have DEV_BSIZE
units for all uses of it in nfs.


We need to assert that the blkno is aligned to the start of a sector
and use the 512 byte units, so I guess it would be:

   offset = dbtob(blkno);
   KASSERT(!(offset & (bsize - 1)), ("suitable diagnostic"));


Barely worth checking.

The current bug has nothing to do with this.  The offset is just wrong
at this point, after using a scale factor that is inconsistent with the
units of blkno, for all (?) disk (and other?) file systems whose sector
size isn't 512.

Bruce
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: fs/udf: vm pages "overlap" while reading large dir

2008-02-12 Thread Bruce Evans

On Tue, 12 Feb 2008, Poul-Henning Kamp wrote:


In message <[EMAIL PROTECTED]>, Andriy Gapon writes:

on 06/02/2008 18:29 Andriy Gapon said the following:

Small summary of the above long description.
For directory reading fs/udf performs bread() on a (underlying) device
vnode. It passes block number as if block size was 512 bytes (i.e.
byte_offset_within_dev/512).


We have three sizes of relevance here, the sectorsize of the provider,
the blocksize of the filesystem and the page size of the system.


4. The block size required for bread() and friends (almost always
   DEV_BSIZE).


In general it is adventurous to have any of them be anything but
powers of two, and it is at best ill-adviced and more likely asking
for trouble to do requests that are not multiple of and aligned to
the sectorsize of the provider.

So up front, I would say that it is an UDF bug to do 512 byte reads off
a 2k sector provider.

Making the buffer cache handle this is possible, but it is not the
direction we have planned for the buffercache (ie: that it should
become a wrapper for using the VM system, rather than being a
separately allocated cache).

So if the objective is to make UDF work in the short term, your
change might work, if the objective is to move FreeBSD's kernel
architecture forward, the right thing to do is to fix UDF to not
access subsectors.


This bug has nothing to do with subsectors, and very little to do with
udf.  udf is just depending on bread()'s API being unbroken.  That API
requires starting with blocks consisting of whole sectors (else the
subsequent I/O would fail) and converting to blocks of size DEV_BSIZE,
phyexcept for unusual (non-disk?) file systems like nfs (and no others?).
All (?) disk file systems do this conversion.  The bug is just more
noticeable for udf since it is used more often on disks with a block
size != DEV_BSIZE, and it apparently does something that makes the bug
mess up VM more than other file systems.

Of course, it might be better for the API to take blocks in units of
the sector size, but that is not what it takes, and this can't be
changed easily or safely.

The standard macro btodb() is often used to convert from bytes to
blocks of this size, and doesn't support bo_bsize or the bufobj pointer
that would be needed to reach bo_bsize.

ffs mostly uses its fsbtodb() macro, which converts blocks in ffs block
(frag?)-sized units to blocks in DEV_BSIZE units so as to pass them
to unbroken bread() and friends.  ffs initializes bo_bsize to its block
(not frag) size, and then never uses it directly except in ffs_rawread(),
where it is used to check that the I/O is in units of whole sectors
(otherwise, ffs_rawread() has DEV_BSIZE more hard-coded than the rest
of ffs).

The details of fsbtodb() are interesting.  They show signs of previous
attempts to use units of sectors.  From ffs/fs.h:

% /*
%  * Turn filesystem block numbers into disk block addresses.
%  * This maps filesystem blocks to device size blocks.
%  */
% #define   fsbtodb(fs, b)  ((daddr_t)(b) << (fs)->fs_fsbtodb)
% #define   dbtofsb(fs, b)  ((b) >> (fs)->fs_fsbtodb)

Creation of fs_fsbtodb is left to newfs.  The units of DEV_BSIZE are thus
built in to the on-disk data struct (in a fairly easy to change and/or
override way, but there are a lot of existing disks).  From newfs.c:

%   realsectorsize = sectorsize;
%   if (sectorsize != DEV_BSIZE) {  /* XXX */
%   int secperblk = sectorsize / DEV_BSIZE;
% 
% 		sectorsize = DEV_BSIZE;

%   fssize *= secperblk;
%   if (pp != NULL)
%   pp->p_size *= secperblk;
%   }
%   mkfs(pp, special);

Though mkfs() appears to support disk blocks of size  =
sectorsize, its sectorsize parameter is hacked on here, so it always
generates fs_fsbtodb and other derived parameters for disk blocks of
size DEV_BSIZE, as is required for the unbroken bread() API.

msdosfs is similar to ffs here, except it constructs the shift count
at mount time, to arrange for always converting to DEV_BSIZE blocks for
passing the bread() and friends.

udf is effectively similar, but pessimal and disorganized.  For the
conversion for bread(), it uses btodb().  In udf_bmap(), it constructs
a shift count on every call by subtracting DEV_BSHIFT from its block shift
count.  In udf_strategy(), on every call it constructs a multiplier
instead of a shift count, by dividing its block size by DEV_BSIZE.  It's
weird that the strategy routing, which will soon end up sending sectors
to the disk, is converting to DEV_BSIZE units, a size that cannot be the
size for udf's normal media.

cd9660 uses btodb() for one conversion for bread() and
( - DEV_BSHIFT) in 7 other conversions to
DEV_BSIZE units.

So it's surprising that any file systems work on CDs and DVDs.  Maybe
the bug affects udf more because udf crosses page boundaries more.

It's also surprising that the bug has such small effects.  This seems
to be because the blkno arg to bread() and fr

Re: fs/udf: vm pages "overlap" while reading large dir

2008-02-12 Thread Andriy Gapon
on 12/02/2008 13:47 Poul-Henning Kamp said the following:
> In message <[EMAIL PROTECTED]>, Andriy Gapon writes:
> 
>> 2.3. this code passes to bread blkno that is calculated as 4*sector,
>> where sector is a number of a physical 2048-byte sector. [**]
>> [**] - I think that this is a requirement of buffcache system, because
>> internally it performs many calculations that seem to assume that block
>> size is always 512.
> 
> Yes, the buf-cache works in 512 bytes units throughout.
> 
>> 3.1. for a fresh buf getlbk would assign the following:
>> bsize = bo->bo_bsize;
>> offset = blkno * bsize;
> 
> That's clearly wrong.
> 
> We need to assert that the blkno is aligned to the start of a sector
> and use the 512 byte units, so I guess it would be:
> 
> offset = dbtob(blkno);
> KASSERT(!(offset & (bsize - 1)), ("suitable diagnostic"));
> 
> 

Thank you for this very insightful and neat suggestion!
I think that it must work but I will try test it tonight on whatever
media and FS-s I have available.
Thank you again!

P.S. hope to not get '"suitable diagnostic"' from something like msdosfs :-)

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: fs/udf: vm pages "overlap" while reading large dir

2008-02-12 Thread Poul-Henning Kamp
In message <[EMAIL PROTECTED]>, Andriy Gapon writes:

>2.3. this code passes to bread blkno that is calculated as 4*sector,
>where sector is a number of a physical 2048-byte sector. [**]
>[**] - I think that this is a requirement of buffcache system, because
>internally it performs many calculations that seem to assume that block
>size is always 512.

Yes, the buf-cache works in 512 bytes units throughout.

>3.1. for a fresh buf getlbk would assign the following:
>bsize = bo->bo_bsize;
>offset = blkno * bsize;

That's clearly wrong.

We need to assert that the blkno is aligned to the start of a sector
and use the 512 byte units, so I guess it would be:

offset = dbtob(blkno);
KASSERT(!(offset & (bsize - 1)), ("suitable diagnostic"));


-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: fs/udf: vm pages "overlap" while reading large dir

2008-02-12 Thread Andriy Gapon
on 12/02/2008 10:53 Poul-Henning Kamp said the following:
> In message <[EMAIL PROTECTED]>, Andriy Gapon writes:
>> on 06/02/2008 18:29 Andriy Gapon said the following:
>>> Small summary of the above long description.
>>> For directory reading fs/udf performs bread() on a (underlying) device
>>> vnode. It passes block number as if block size was 512 bytes (i.e.
>>> byte_offset_within_dev/512).
> 
> We have three sizes of relevance here, the sectorsize of the provider,
> the blocksize of the filesystem and the page size of the system.
> 
> In general it is adventurous to have any of them be anything but
> powers of two, and it is at best ill-adviced and more likely asking
> for trouble to do requests that are not multiple of and aligned to
> the sectorsize of the provider.
> 
> So up front, I would say that it is an UDF bug to do 512 byte reads off
> a 2k sector provider.
> 
> Making the buffer cache handle this is possible, but it is not the
> direction we have planned for the buffercache (ie: that it should
> become a wrapper for using the VM system, rather than being a
> separately allocated cache).
> 
> So if the objective is to make UDF work in the short term, your
> change might work, if the objective is to move FreeBSD's kernel
> architecture forward, the right thing to do is to fix UDF to not
> access subsectors.
> 

Poul,

I agree with what you say, but I think that I didn't properly explain
what is UDF code doing and why it might be important in general.
Let me try to do it step-by-step (sorry if I'll say something too
obvious). And please correct me if I misunderstand something in the
fundamental steps.

1.1. UDF is typically used with CD/DVD media, so provider's sector size
is 2048.
1.2. udf vfs mount method calls g_vfs_open.
1.3. g_vfs_open creates vobject for a device vnode.
1.4. g_vfs_open sets bo_bsize=pp->sectorsize in the device vnode's bufobj.
1.5. g_vfs_open also overrides bo_ops for the bufobj.

2.1. UDF directory reading code performs bread() via the device vnode. [*]
2.2. this code passes to bread a size that is multiple of 2048.
2.3. this code passes to bread blkno that is calculated as 4*sector,
where sector is a number of a physical 2048-byte sector. [**]

[*]  - this seems to be an uncommon thing among filesystems.
[**] - I think that this is a requirement of buffcache system, because
internally it performs many calculations that seem to assume that block
size is always 512.
E.g. breadn() code has the following:
bp->b_iooffset = dbtob(bp->b_blkno); <-- effectively multiplies by 512
bstrategy(bp);
And g_vfs_strategy has the following:
bip->bio_offset = bp->b_iooffset;
So, if udf code would pass blkno==sector, then bio_offset would be
incorrect.
Or maybe g_vfs_strategy should do some translation here from b_iooffset
to bio_offset taking into account bo_bsize ?? So that the actual,
non-adjusted, sector number could be passed to the bread() ?

3.1. for a fresh buf getlbk would assign the following:
bsize = bo->bo_bsize;
offset = blkno * bsize;
...
bp->b_blkno = bp->b_lblkno = blkno;
bp->b_offset = offset; <--- this is where this discussion started
so b_offset of a buffer is 4*sector*2048.
This is a source of the trouble.
3.2. getblk would set bp->b_flags |= B_VMIO; if a vnode has a vobject
and our device vnode has it (step 1.3).
3.3. consequently allocbuf will execute code for B_VMIO case.
3.4. allocbuf will lookup/allocate pages by index which is calculated
from "base index" of OFF_TO_IDX(bp->b_offset), which is, in essence,
bp->b_offset divided by page size, 4096. So our "base index" is 2*sector.

4.1. Let's assume we bread() (via the device vnode, as described above)
from physical sector N and we read 6 physical sectors (6*2048 bytes).
4.2 Sectors will get "mapped"/tied/bound to VM pages as follows
(according to the above calculations):
sectors[N, N+1] -> page[2*N],
sectors[N+2, N+3] -> page[2*N + 1], /* the next page */
sectors[N+4, N+5] -> page[2*N + 2]  /* the next page */
4.5 Now lets consider "the next read" of X sectors but now starting from
sector N+1; repeating the calculations we get the following mapping:
sectors[(N+1), (N+1) + 1] -> page[2*(N+1)] = page[2N +2]
But this page already has cached data from sectors[N+4, N+5].
Problem!!

Theoretical calculations show it and practical testing confirms that.
So this is a proof that bread()-ing via a device vnode is broken if:
C1) the vnode was "set up" by g_vfs_open();
C2) sector size of underlying geom provider is not 512, but any
non-trivial multiple of it;
C3) bread size is sufficiently big;

Current UDF code for directory reading is the only known to me place
that meets all the 3 above conditions (for sufficiently large
directories to meet condition C3).

So I stated this, now let the wise speak.

I already have a patch that makes UDF read directories via the directory
vnodes. But the problem in general would remain.
Maybe g_vfs_open is a correct place to change, maybe g_vfs_strategy is
the place, maybe something, maybe "don

Re: fs/udf: vm pages "overlap" while reading large dir

2008-02-12 Thread Poul-Henning Kamp
In message <[EMAIL PROTECTED]>, Andriy Gapon writes:
>on 06/02/2008 18:29 Andriy Gapon said the following:
>> Small summary of the above long description.
>> For directory reading fs/udf performs bread() on a (underlying) device
>> vnode. It passes block number as if block size was 512 bytes (i.e.
>> byte_offset_within_dev/512).

We have three sizes of relevance here, the sectorsize of the provider,
the blocksize of the filesystem and the page size of the system.

In general it is adventurous to have any of them be anything but
powers of two, and it is at best ill-adviced and more likely asking
for trouble to do requests that are not multiple of and aligned to
the sectorsize of the provider.

So up front, I would say that it is an UDF bug to do 512 byte reads off
a 2k sector provider.

Making the buffer cache handle this is possible, but it is not the
direction we have planned for the buffercache (ie: that it should
become a wrapper for using the VM system, rather than being a
separately allocated cache).

So if the objective is to make UDF work in the short term, your
change might work, if the objective is to move FreeBSD's kernel
architecture forward, the right thing to do is to fix UDF to not
access subsectors.

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: fs/udf: vm pages "overlap" while reading large dir

2008-02-11 Thread Andriy Gapon
on 06/02/2008 18:29 Andriy Gapon said the following:
> Small summary of the above long description.
> For directory reading fs/udf performs bread() on a (underlying) device
> vnode. It passes block number as if block size was 512 bytes (i.e.
> byte_offset_within_dev/512). On the other hand vm page index calculation
> code uses the following formula: (blkno*bo_bsize)/PAGE_SIZE.
> For CD/DVD devices bo_bsize is set to 2048. This results in page index
> overlap when reading sufficiently many blocks from adjacent starting blocks.
> That is, it seems that in "normal" cases page index gets calculated as
> byte_offset/PAGE_SIZE, but in this case page index gets to be
> 4*byte_offset/PAGE_SIZE. More details are in the quoted above text.
> 
> With a lot of help from Bruce Evance I found this commit which seems to
> have made a big difference:
> http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_bio.c.diff?r1=1.458;r2=1.459;f=h
> 
> Before this change page index for device vnodes was always
> (blkno*512)/PAGE_SIZE. This is because of the vn_isdisk() case.
> 
> udf_mountfs device vnode is passed to g_vfs_open() (in geom_vfs.c) and
> the latter has the following line of code:
> bo->bo_bsize = pp->sectorsize;
> Where pp is geom provider for the device in question.
> 
> I am not sure if the mentioned above vfs_bio.c change broke something
> else in reading from device vnodes or if it fixed something for that. I
> am also not sure what would be the consequences of setting bo_bsize to
> 512 for vnodes of "disk" devices. It would most probably fix current
> code in UDF, but I am not sure if it might break anything else.
> 
> Just wanted to let the more knowledgeable people know and make a decision.
> 
> P.S. bo_bsize seems to be referenced only in a handful of files and in
> most of them it seems that it is related to "file" vnodes (as opposed to
> "device" vnodes).
> 

Poul-Henning,

I am sorry for this duplicate post and very sorry for misspelling your
name in the first one.

do you have any comment or opinion on the above?
[sorry if clamped too much of the previous context, but it's all in
archives]

And a little continuation:
Just for the sake of experiment I tried to emulated the previous
behavior by changing geom_vfs.c, function g_vfs_open(), so that the
relevant code reads as follows:
if (vn_isdisk(vp, NULL))
bo->bo_bsize = DEV_BSIZE;
else
bo->bo_bsize = pp->sectorsize;

It didn't break anything for me and it re-enabled current (CVS) fs/udf
code to work as before. (patch from kern/78987 still has to be applied
for large directories to be handled).
So udf (on DVD) is fixed, ufs (on HDD), msdosfs (on HDD) and cd9660 (on
CD) are not broken. Of course, the change should have affected only
filesystems on CD/DVD (where sector/block size is not 512 bytes).
Of course, unlike Bruce I don't use msdosfs on CD/DVD media (e.g.
DVD-RAM), so my test is somewhat incomplete.

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: fs/udf: vm pages "overlap" while reading large dir

2008-02-11 Thread Andriy Gapon
on 06/02/2008 18:29 Andriy Gapon said the following:
> Small summary of the above long description.
> For directory reading fs/udf performs bread() on a (underlying) device
> vnode. It passes block number as if block size was 512 bytes (i.e.
> byte_offset_within_dev/512). On the other hand vm page index calculation
> code uses the following formula: (blkno*bo_bsize)/PAGE_SIZE.
> For CD/DVD devices bo_bsize is set to 2048. This results in page index
> overlap when reading sufficiently many blocks from adjacent starting blocks.
> That is, it seems that in "normal" cases page index gets calculated as
> byte_offset/PAGE_SIZE, but in this case page index gets to be
> 4*byte_offset/PAGE_SIZE. More details are in the quoted above text.
> 
> With a lot of help from Bruce Evance I found this commit which seems to
> have made a big difference:
> http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_bio.c.diff?r1=1.458;r2=1.459;f=h
> 
> Before this change page index for device vnodes was always
> (blkno*512)/PAGE_SIZE. This is because of the vn_isdisk() case.
> 
> udf_mountfs device vnode is passed to g_vfs_open() (in geom_vfs.c) and
> the latter has the following line of code:
> bo->bo_bsize = pp->sectorsize;
> Where pp is geom provider for the device in question.
> 
> I am not sure if the mentioned above vfs_bio.c change broke something
> else in reading from device vnodes or if it fixed something for that. I
> am also not sure what would be the consequences of setting bo_bsize to
> 512 for vnodes of "disk" devices. It would most probably fix current
> code in UDF, but I am not sure if it might break anything else.
> 
> Just wanted to let the more knowledgeable people know and make a decision.
> 
> P.S. bo_bsize seems to be referenced only in a handful of files and in
> most of them it seems that it is related to "file" vnodes (as opposed to
> "device" vnodes).
> 

Paul,

do you have any comment or opinion on the above?
[sorry if clamped too much of the previous context, but it's all in
archives]

And a little continuation:
Just for the sake of experiment I tried to emulated the previous
behavior by changing geom_vfs.c, function g_vfs_open(), so that the
relevant code reads as follows:
if (vn_isdisk(vp, NULL))
bo->bo_bsize = DEV_BSIZE;
else
bo->bo_bsize = pp->sectorsize;

It didn't break anything for me and it re-enabled current (CVS) fs/udf
code to work as before. (patch from kern/78987 still has to be applied
for large directories to be handled).
So udf (on DVD) is fixed, ufs (on HDD), msdosfs (on HDD) and cd9660 (on
CD) are not broken. Of course, the change should have affected only
filesystems on CD/DVD (where sector/block size is not 512 bytes).
Of course, unlike Bruce I don't use msdosfs on CD/DVD media (e.g.
DVD-RAM), so my test is somewhat incomplete.

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: fs/udf: vm pages "overlap" while reading large dir [patch]

2008-02-06 Thread Andriy Gapon
on 06/02/2008 18:34 Andriy Gapon said the following:
> 
> Actually the patch is not entirely correct. max_size returned from
> udf_bmap_internal should be used to calculate number of continuous
> sectors for read-ahead (as opposed to file size in the patch).
> 

Attached is an updated patch.

The most prominent changes from the previous version:
1. udf_read can handle files with data embedded into fentry
(sysutils/udfclient can produce such things for small files).
2. above mentioned files can now be mmap-ed, so that you can not only
cat them, but also cp them; btw, I believe that cp(1) should have logic
to try to fallback to read if mmap() fails - remember kern/92040 ? :-)
3. udf_bmap uses max_size[*] returned by udf_bmap_internal to calculate
a number of contiguous blocks for read-ahead; [*] - this is a number of
contiguous bytes from a given offset within a file.

Things that stay the same:
1. I still use bread() via directory vnode; I think it's better even if
reading via device vnode would work correctly.
2. I still try to read as much data as possible in directory bread(), I
think this is a sort of an ad-hoc read-ahead without all the heuristics
and logic that cluster reading does; I think that this should be ok
because we do not have random reading of directory, it's always
sequential - either to read-in the whole directory or linearly search
through it until wanted entry is found.

Detailed description of the patch.
udf_vnops.c:
Hunk1 - this is "just in case" patch, reject files with unsupported
strategy right away instead of deferring failure to udf_bmap_internal.
Hunk2 - fix typo in existing macro, add new macro to check if a file has
data embedded into fentry ("unusual file").
Hunk3 - for an unusual file just uiomove data from fentry.
Hunk4 - cosmetic changes plus a fix for udf_bmap_internal errors not
actually being honored; also added an additional printf, because
udf_strategy should never really be called for unusual files.
Hunk5 - return EOPNOTSUPP for unusual files, this is correct because we
can not correctly bmap them and also this enables correct handling of
this files in vm code (vnode_pager); also code for read-ahead
calculation borrowed from cd9660.
Hunk6- explain function of udf_bmap_internal call in this place.
Hunk7 - some cosmetics; prevent size passed to bread from being greater
than MAXBSIZE; do bread via directory vnode rather than device vnode
(udf_readlblks was a macro-wrapper around bread).

udf_vfsops.c:
Hunk1 - this is borrowed from cd9660, apparently this data is needed for
correct cluster reading.


Couple of words about testing.
udf in 7.0-RC1 plus this patch correctly handled everything I could
throw at it - huge directories, unusual files, reading, mmaping.
Using udfclientfs I wrote a directory on DVD-RAM UDF disk that contained
 2G of files from ports distfiles. The files varied in size from about
100 of bytes to hundreds of megabytes. I watched systat -vmstat while
copying this directory. With unpatched code some files would not be
copied (small "unusual files"), KB/t value stayed at 2K (meaning reads
always were sector sized), MB/s was about 1. With the patch all files
were copied successfully and correctly (md5 verified), KB/t varied from
2K to 64K (apparently depending on size of currently copied file), MB/s
varied from 1 to 4 which is not bad for these DVD-RAM disk and drive.
If somebody asks I can produce a UDF image that would contain various
test cases: large directory, unusual files, etc. Or you can give a try
to udfclient/udfclientfs from sysutils/udfclient port.

I hope this patch will be useful.

-- 
Andriy Gapon
--- udf_vnops.c.orig	2008-01-29 23:50:49.0 +0200
+++ udf_vnops.c	2008-02-07 00:12:34.0 +0200
@@ -168,6 +168,16 @@ udf_open(struct vop_open_args *ap) {
 	struct udf_node *np = VTON(ap->a_vp);
 	off_t fsize;
 
+	/*
+	 * We currently do not support any other strategy type, so
+	 * udf_bmap_internal, udf_bmap, udf_strategy would fail.
+	 * I.e. we can not access content of this file anyway.
+	 */
+	if (le16toh(np->fentry->icbtag.strat_type) != 4) {
+		printf("Unsupported strategy type %d\n", le16toh(np->fentry->icbtag.strat_type));
+		return ENODEV;
+	}
+
 	fsize = le64toh(np->fentry->inf_len);
 	vnode_create_vobject(ap->a_vp, fsize, ap->a_td);
 	return 0;
@@ -340,7 +350,9 @@ udf_pathconf(struct vop_pathconf_args *a
 
 #define lblkno(udfmp, loc)	((loc) >> (udfmp)->bshift)
 #define blkoff(udfmp, loc)	((loc) & (udfmp)->bmask)
-#define lblktosize(imp, blk)	((blk) << (udfmp)->bshift)
+#define lblktosize(udfmp, blk)	((blk) << (udfmp)->bshift)
+
+#define HAS_EMBEDDED_DATA(node) ((le16toh((node)->fentry->icbtag.flags) & 0x7) == 3)
 
 static int
 udf_read(struct vop_read_args *ap)
@@ -359,6 +371,26 @@ udf_read(struct vop_read_args *ap)
 		return (0);
 	if (uio->uio_offset < 0)
 		return (EINVAL);
+
+	if (HAS_EMBEDDED_DATA(node)) {
+		struct file_entry *fentry;
+		uint8_t *data;
+
+		fentry = node->fentry;
+		data = &fentry->data[le32toh(fentr

Re: fs/udf: vm pages "overlap" while reading large dir [patch]

2008-02-06 Thread Andriy Gapon
on 05/02/2008 22:43 Scott Long said the following:
> Andriy Gapon wrote:
> 
>> But there is another issue that I also mentioned in the email about
>> directory reading. It is UDF_INVALID_BMAP case of udf_bmap_internal,
>> i.e. the case when file data is embedded into a file entry.
>> This is a special case that needs to be handled differently.
>> udf_readatoffset() handles it, but the latest udf_read code doesn't.
>> I have a real UDF filesystem where this type of allocation is used for
>> small files and those files can not be read.
> 
> Oh, so directory data can also follow this convention?  Blah.  Feel free
> to fix that too if you want =-)

Yes, this allocation type can be applied to both files and directories.
Directories are already handled because of readatoffset() but udf_read
needs to be made aware of this special case.

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: fs/udf: vm pages "overlap" while reading large dir [patch]

2008-02-06 Thread Andriy Gapon
on 05/02/2008 20:16 Pav Lucistnik said the following:
> Andriy Gapon píše v út 05. 02. 2008 v 16:40 +0200:
> 
>>> Yay, and can you fix the sequential read performance while you're at it?
>>> Kthx!
> 
>> this was almost trivial :-)
>> See the attached patch, first hunk is just for consistency.
>> The code was borrowed from cd9660, only field/variable names are adjusted.
> 
> Omg looks good. Can't wait for it to bubble throught to 6-STABLE :)
> 
> Thanks million for working on this.
> 

Actually the patch is not entirely correct. max_size returned from
udf_bmap_internal should be used to calculate number of continuous
sectors for read-ahead (as opposed to file size in the patch).

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: fs/udf: vm pages "overlap" while reading large dir

2008-02-06 Thread Andriy Gapon
on 04/02/2008 17:55 Scott Long said the following:
> Andriy Gapon wrote:
[snip]
>> After some code reading and run-time debugging, here are some facts
>> about udf directory reading:
>> 1. bread-ing is done via device vnode (as opposed to directory vnodes),
>> as a consequence udf_strategy is not involved in directory reading
>> 2. the device vnode has a non-NULL v_object, this means that VMIO is used
[strike out incorrect guess]
>> 4. bread is done on as much data as possible, up to MAXBSIZE [and even
>> slightly more in some cases] (i.e. code determines directory data size
>> and attempts to read in everything in one go)
>> 5. physical sector number adjusted to DEV_BSIZE (512 bytes) sectors is
>> passed to bread - this is because of #1 above and peculiarities of buf
>> system
>> 6. directory reading and file reading are quite different, because the
>> latter does reading via file vnode
>>
>> Let's a consider a simple case of "directory reading" 5 * PAGE_SIZE (4K)
>> bytes starting from a physical sector N with N%2 = 0 from media with
>> sector size of 2K (most common CD/DVD case):
>> 1. we want to read 12 KB = 3 pages = 6 sectors starting from sector N,
>> N%2 = 0
>> 2. 4*N is passed as a sector number to bread
>> 3. bo_bsize of the corresponding bufobj is a media sector size, i.e. 2K
>> 4. actual read in bread will happen from b_iooffset of 4*N*DEV_BSIZE =
>> N*2K, i.e. correct offset within the device
>> 5. for a fresh read getblk will be called
>> 6. getblk will set b_offset to blkno*bo_bsize=4*N*2K, i.e. 4 times the
>> correct byte offset within the device
>> 7. allocbuf will allocate pages using B_VMIO case
>> 8. allocbuf calculates base page index as follows: pi = b_offset/PAGE_SIZE
>> this means the following:
>> sectors N and N+1 will be bound to a page with index 4*N*2K/4K = 2*N
>> sectors N+2 and N+3 will be bound to the next page 2*N +1
>> sectors N+4 and N+5 is tied to the next page 2*N + 2
[insert] note that bstrategy will get correct params as b_iooffset is
set to blkno*DEV_BSIZE, i.e. correct byte offset.
>>
>> Now let's consider a "directory read" of 2 sectors (1 page) starting at
>> physical sector N+1.
>> Using the same calculations as above, we see that the sector will be
>> tied to a page 2*(N+1) = 2*N + 2.
>> And this page already contains valid cached data from the previous read,
>> *but* it contains data from sectors N+4 and N+5 instead of N+1 and N+2.
>>
>> So, we see, that because of b_offset being 4 times the actual byte
>> offset, we get incorrect page indexing. Namely, sector X gets associated
>> with different pages depending on what sector is used as a starting
>> sector for bread. If bread starts at sector N, then data of sector N+1
>> is associated with (second half of) page with index 2*N; but if bread
>> starts at sector N+1 then it gets associated with (the first half of)
>> page with index 2*(N+1).
[snip]
> I think you forgot to attach the patch =-)  This is an excellent 
> write-up, and I think you're on the right track.  It's been nearly 8 
> years since I wrote most of this code, so my memory is a bit fuzzy, but
> I think the reason that I used the device vnode is because I was having
> trouble with overlapping buffers when using the directory vnode, so this
> was the easiest way to avoid that at the time.  Since it sounds like 
> this is no longer a viable solution, I definitely support your ideas.

[additional analysis of the problem, with new facts that might give rise
to a different solution]

Small summary of the above long description.
For directory reading fs/udf performs bread() on a (underlying) device
vnode. It passes block number as if block size was 512 bytes (i.e.
byte_offset_within_dev/512). On the other hand vm page index calculation
code uses the following formula: (blkno*bo_bsize)/PAGE_SIZE.
For CD/DVD devices bo_bsize is set to 2048. This results in page index
overlap when reading sufficiently many blocks from adjacent starting blocks.
That is, it seems that in "normal" cases page index gets calculated as
byte_offset/PAGE_SIZE, but in this case page index gets to be
4*byte_offset/PAGE_SIZE. More details are in the quoted above text.

With a lot of help from Bruce Evance I found this commit which seems to
have made a big difference:
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_bio.c.diff?r1=1.458;r2=1.459;f=h

Before this change page index for device vnodes was always
(blkno*512)/PAGE_SIZE. This is because of the vn_isdisk() case.

udf_mountfs device vnode is passed to g_vfs_open() (in geom_vfs.c) and
the latter has the following line of code:
bo->bo_bsize = pp->sectorsize;
Where pp is geom provider for the device in question.

I am not sure if the mentioned above vfs_bio.c change broke something
else in reading from device vnodes or if it fixed something for that. I
am also not sure what would be the consequences of setting bo_bsize to
512 for vnodes of "disk" devices. It would most probably fix current
code in UDF, but I am not sure if i

Re: fs/udf: vm pages "overlap" while reading large dir [patch]

2008-02-05 Thread Julian Elischer

Andriy Gapon wrote:

on 04/02/2008 22:07 Pav Lucistnik said the following:

Julian Elischer píše v po 04. 02. 2008 v 10:36 -0800:

Andriy Gapon wrote:

More on the problem with reading big directories on UDF.

You do realise that you have now made yourself the official
maintainer of the UDF file system by submitting a competent
and insightful analysis of the problem?

Yay, and can you fix the sequential read performance while you're at it?
Kthx!



Pav,

this was almost trivial :-)


Ok that settles it.. it's yours.


See the attached patch, first hunk is just for consistency.
The code was borrowed from cd9660, only field/variable names are adjusted.

But there is another issue that I also mentioned in the email about
directory reading. It is UDF_INVALID_BMAP case of udf_bmap_internal,
i.e. the case when file data is embedded into a file entry.
This is a special case that needs to be handled differently.
udf_readatoffset() handles it, but the latest udf_read code doesn't.
I have a real UDF filesystem where this type of allocation is used for
small files and those files can not be read.

This is described in Part 4, section 14.6.8 of ECMA-167.




___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: fs/udf: vm pages "overlap" while reading large dir [patch]

2008-02-05 Thread Scott Long

Andriy Gapon wrote:

on 04/02/2008 22:07 Pav Lucistnik said the following:

Julian Elischer píše v po 04. 02. 2008 v 10:36 -0800:

Andriy Gapon wrote:

More on the problem with reading big directories on UDF.

You do realise that you have now made yourself the official
maintainer of the UDF file system by submitting a competent
and insightful analysis of the problem?

Yay, and can you fix the sequential read performance while you're at it?
Kthx!



Pav,

this was almost trivial :-)
See the attached patch, first hunk is just for consistency.
The code was borrowed from cd9660, only field/variable names are adjusted.



Your patch looks reasonable.  Btw, for the same reason that read-ahead
makes file reading much faster, I would not change directory reading to
be 1 sector at a time (unless you also do read-ahead for it).


But there is another issue that I also mentioned in the email about
directory reading. It is UDF_INVALID_BMAP case of udf_bmap_internal,
i.e. the case when file data is embedded into a file entry.
This is a special case that needs to be handled differently.
udf_readatoffset() handles it, but the latest udf_read code doesn't.
I have a real UDF filesystem where this type of allocation is used for
small files and those files can not be read.


Oh, so directory data can also follow this convention?  Blah.  Feel free
to fix that too if you want =-)

Scott

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: fs/udf: vm pages "overlap" while reading large dir [patch]

2008-02-05 Thread Pav Lucistnik
Andriy Gapon píše v út 05. 02. 2008 v 16:40 +0200:

> > Yay, and can you fix the sequential read performance while you're at it?
> > Kthx!

> this was almost trivial :-)
> See the attached patch, first hunk is just for consistency.
> The code was borrowed from cd9660, only field/variable names are adjusted.

Omg looks good. Can't wait for it to bubble throught to 6-STABLE :)

Thanks million for working on this.

-- 
Pav Lucistnik <[EMAIL PROTECTED]>
  <[EMAIL PROTECTED]>
What is the airspeed velocity of an unladen swallow?


signature.asc
Description: Toto je digitálně	 podepsaná část	 zprávy


Re: fs/udf: vm pages "overlap" while reading large dir [patch]

2008-02-05 Thread Andriy Gapon
on 04/02/2008 20:36 Julian Elischer said the following:
> Andriy Gapon wrote:
>> More on the problem with reading big directories on UDF.
> 
> You do realise that you have now made yourself the official
> maintainer of the UDF file system by submitting a competent
> and insightful analysis of the problem?

I feel like I have to reply to this :-)
First of all, thank you. But unfortunately my knowledge of core kernel
is quite limited, I pick up some bits here and there by reading the
code, but this is too little and sometimes my deductions are wrong.
Another thing is very limited time.

I do like fixing code, but I rarely have time and when I have it must be
something very interesting and important to me.

This time it is interesting :-) Because I actually use UDF quite a lot,
I use DVD-RAM media for various archival and data transfer purposes, and
sometimes there is some unusual case. I also try to keep up with FreeBSD
development as much as possible, so sometimes there are new issues after
upgrades.

So, thank you again, but please think of me as of occasional patch
submitter.

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: fs/udf: vm pages "overlap" while reading large dir [patch]

2008-02-05 Thread Andriy Gapon
on 04/02/2008 22:07 Pav Lucistnik said the following:
> Julian Elischer píše v po 04. 02. 2008 v 10:36 -0800:
>> Andriy Gapon wrote:
>>> More on the problem with reading big directories on UDF.
>> You do realise that you have now made yourself the official
>> maintainer of the UDF file system by submitting a competent
>> and insightful analysis of the problem?
> 
> Yay, and can you fix the sequential read performance while you're at it?
> Kthx!
> 

Pav,

this was almost trivial :-)
See the attached patch, first hunk is just for consistency.
The code was borrowed from cd9660, only field/variable names are adjusted.

But there is another issue that I also mentioned in the email about
directory reading. It is UDF_INVALID_BMAP case of udf_bmap_internal,
i.e. the case when file data is embedded into a file entry.
This is a special case that needs to be handled differently.
udf_readatoffset() handles it, but the latest udf_read code doesn't.
I have a real UDF filesystem where this type of allocation is used for
small files and those files can not be read.

This is described in Part 4, section 14.6.8 of ECMA-167.

-- 
Andriy Gapon
--- udf_vnops.c.orig	2008-01-29 23:50:49.0 +0200
+++ udf_vnops.c	2008-02-05 01:30:23.0 +0200
@@ -851,7 +846,7 @@ udf_bmap(struct vop_bmap_args *a)
 	if (a->a_runb)
 		*a->a_runb = 0;
 
-	error = udf_bmap_internal(node, a->a_bn * node->udfmp->bsize, &lsector,
+	error = udf_bmap_internal(node, a->a_bn << node->udfmp->bshift, &lsector,
 	&max_size);
 	if (error)
 		return (error);
@@ -859,9 +854,27 @@ udf_bmap(struct vop_bmap_args *a)
 	/* Translate logical to physical sector number */
 	*a->a_bnp = lsector << (node->udfmp->bshift - DEV_BSHIFT);
 
-	/* Punt on read-ahead for now */
-	if (a->a_runp)
-		*a->a_runp = 0;
+	/*
+	 * Determine maximum number of readahead blocks following the
+	 * requested block.
+	 */
+	if (a->a_runp) {
+		off_t fsize;
+		int nblk;
+
+		fsize = le64toh(node->fentry->inf_len);
+		nblk = (fsize >> node->udfmp->bshift) - (a->a_bn + 1);
+		if (nblk <= 0)
+			*a->a_runp = 0;
+		else if (nblk >= (MAXBSIZE >> node->udfmp->bshift))
+			*a->a_runp = (MAXBSIZE >> node->udfmp->bshift) - 1;
+		else
+			*a->a_runp = nblk;
+	}
+
+	if (a->a_runb) {
+		*a->a_runb = 0;
+	}
 
 	return (0);
 }
--- udf_vfsops.c.orig	2007-03-13 03:50:24.0 +0200
+++ udf_vfsops.c	2008-02-05 01:29:10.0 +0200
@@ -330,6 +330,11 @@ udf_mountfs(struct vnode *devvp, struct 
 
 	bo = &devvp->v_bufobj;
 
+	if (devvp->v_rdev->si_iosize_max != 0)
+		mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max;
+	if (mp->mnt_iosize_max > MAXPHYS)
+		mp->mnt_iosize_max = MAXPHYS;
+
 	/* XXX: should be M_WAITOK */
 	MALLOC(udfmp, struct udf_mnt *, sizeof(struct udf_mnt), M_UDFMOUNT,
 	M_NOWAIT | M_ZERO);
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: fs/udf: vm pages "overlap" while reading large dir [patch]

2008-02-04 Thread Pav Lucistnik
Julian Elischer píše v po 04. 02. 2008 v 10:36 -0800:
> Andriy Gapon wrote:
> > More on the problem with reading big directories on UDF.
> 
> You do realise that you have now made yourself the official
> maintainer of the UDF file system by submitting a competent
> and insightful analysis of the problem?

Yay, and can you fix the sequential read performance while you're at it?
Kthx!

-- 
Pav Lucistnik <[EMAIL PROTECTED]>
  <[EMAIL PROTECTED]>
Squish. Larger than the normal icky things, and twice as icky.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: fs/udf: vm pages "overlap" while reading large dir [patch]

2008-02-04 Thread Julian Elischer

Andriy Gapon wrote:

More on the problem with reading big directories on UDF.


You do realise that you have now made yourself the official
maintainer of the UDF file system by submitting a competent
and insightful analysis of the problem?



First, some sleuthing. I came to believe that the problem is caused by
some larger change in vfs/vm/buf area. It seems that now VMIO is applied
to more vnode types than before. In particular it seems that now vnodes
for devices have non-NULL v_object (or maybe this is about directory
vnodes, I am not sure).

Also it seems that the problem should happen for any directory with size
larger than four 2048-bytes sectors (I think that any directory with >
300 files would definitely qualify).

After some code reading and run-time debugging, here are some facts
about udf directory reading:
1. bread-ing is done via device vnode (as opposed to directory vnodes),
as a consequence udf_strategy is not involved in directory reading
2. the device vnode has a non-NULL v_object, this means that VMIO is used
3. it seems that the code assumed that non-VM buffers are used
4. bread is done on as much data as possible, up to MAXBSIZE [and even
slightly more in some cases] (i.e. code determines directory data size
and attempts to read in everything in one go)
5. physical sector number adjusted to DEV_BSIZE (512 bytes) sectors is
passed to bread - this is because of #1 above and peculiarities of buf
system
6. directory reading and file reading are quite different, because the
latter does reading via file vnode

Let's a consider a simple case of "directory reading" 5 * PAGE_SIZE (4K)
bytes starting from a physical sector N with N%2 = 0 from media with
sector size of 2K (most common CD/DVD case):
1. we want to read 12 KB = 3 pages = 6 sectors starting from sector N,
N%2 = 0
2. 4*N is passed as a sector number to bread
3. bo_bsize of the corresponding bufobj is a media sector size, i.e. 2K
4. actual read in bread will happen from b_iooffset of 4*N*DEV_BSIZE =
N*2K, i.e. correct offset within the device
5. for a fresh read getblk will be called
6. getblk will set b_offset to blkno*bo_bsize=4*N*2K, i.e. 4 times the
correct byte offset within the device
7. allocbuf will allocate pages using B_VMIO case
8. allocbuf calculates base page index as follows: pi = b_offset/PAGE_SIZE
this means the following:
sectors N and N+1 will be bound to a page with index 4*N*2K/4K = 2*N
sectors N+2 and N+3 will be bound to the next page 2*N +1
sectors N+4 and N+5 is tied to the next page 2*N + 2

Now let's consider a "directory read" of 2 sectors (1 page) starting at
physical sector N+1.
Using the same calculations as above, we see that the sector will be
tied to a page 2*(N+1) = 2*N + 2.
And this page already contains valid cached data from the previous read,
*but* it contains data from sectors N+4 and N+5 instead of N+1 and N+2.

So, we see, that because of b_offset being 4 times the actual byte
offset, we get incorrect page indexing. Namely, sector X gets associated
with different pages depending on what sector is used as a starting
sector for bread. If bread starts at sector N, then data of sector N+1
is associated with (second half of) page with index 2*N; but if bread
starts at sector N+1 then it gets associated with (the first half of)
page with index 2*(N+1).

Previously, before VMIO change, data for all reads was put into separate
buffers that did not share anything between them. So the problem was
limited only to wasting memory with duplicate data, but no actual
over-runs did happen. Now we have the over-runs because the VM pages are
shared between the buffers of the same vnode.

One obvious solution is to limit bread size to 2*PAGE_SIZE = 4 *
sector_size. In this case, as before, we would waste some memory on
duplicate data but we would avoid page overruns. If we limit bread size
even more, to 1 sector, then we would not have any duplicate data at
all. But there would still be some resource waste - each page would
correspond to one sector, so 4K page would have only 2K of valid data
and the other half in each page is unused.

Another solution, which to me seems to be better, is to do "usual"
reading - pass a directory vnode and 2048-byte sector offset to bread.
In this case udf_strategy will get called for bklno translation, all
offsets and indexes will be correct and everything will work perfectly
as it is the case for all other filesystems.
The only overhead in this case comes from the need to handle
UDF_INVALID_BMAP case (where data is embedded into file entry). So it
means that we have to call bmap_internal to see if we have that special
case and hanlde it, and if the case is not special we would call bread
on a directory vnode which means that bmap_internal would be called
again in udf_strategy.
One optimization that can be done in this case is to create a ligher
version of bmap_internal that would merely check for the special case
and wouldn't do anything else.

I am attaching a patch based on the

Re: fs/udf: vm pages "overlap" while reading large dir

2008-02-04 Thread Scott Long

Andriy Gapon wrote:

More on the problem with reading big directories on UDF.

First, some sleuthing. I came to believe that the problem is caused by
some larger change in vfs/vm/buf area. It seems that now VMIO is applied
to more vnode types than before. In particular it seems that now vnodes
for devices have non-NULL v_object (or maybe this is about directory
vnodes, I am not sure).

Also it seems that the problem should happen for any directory with size
larger than four 2048-bytes sectors (I think that any directory with >
300 files would definitely qualify).

After some code reading and run-time debugging, here are some facts
about udf directory reading:
1. bread-ing is done via device vnode (as opposed to directory vnodes),
as a consequence udf_strategy is not involved in directory reading
2. the device vnode has a non-NULL v_object, this means that VMIO is used
3. it seems that the code assumed that non-VM buffers are used
4. bread is done on as much data as possible, up to MAXBSIZE [and even
slightly more in some cases] (i.e. code determines directory data size
and attempts to read in everything in one go)
5. physical sector number adjusted to DEV_BSIZE (512 bytes) sectors is
passed to bread - this is because of #1 above and peculiarities of buf
system
6. directory reading and file reading are quite different, because the
latter does reading via file vnode

Let's a consider a simple case of "directory reading" 5 * PAGE_SIZE (4K)
bytes starting from a physical sector N with N%2 = 0 from media with
sector size of 2K (most common CD/DVD case):
1. we want to read 12 KB = 3 pages = 6 sectors starting from sector N,
N%2 = 0
2. 4*N is passed as a sector number to bread
3. bo_bsize of the corresponding bufobj is a media sector size, i.e. 2K
4. actual read in bread will happen from b_iooffset of 4*N*DEV_BSIZE =
N*2K, i.e. correct offset within the device
5. for a fresh read getblk will be called
6. getblk will set b_offset to blkno*bo_bsize=4*N*2K, i.e. 4 times the
correct byte offset within the device
7. allocbuf will allocate pages using B_VMIO case
8. allocbuf calculates base page index as follows: pi = b_offset/PAGE_SIZE
this means the following:
sectors N and N+1 will be bound to a page with index 4*N*2K/4K = 2*N
sectors N+2 and N+3 will be bound to the next page 2*N +1
sectors N+4 and N+5 is tied to the next page 2*N + 2

Now let's consider a "directory read" of 2 sectors (1 page) starting at
physical sector N+1.
Using the same calculations as above, we see that the sector will be
tied to a page 2*(N+1) = 2*N + 2.
And this page already contains valid cached data from the previous read,
*but* it contains data from sectors N+4 and N+5 instead of N+1 and N+2.

So, we see, that because of b_offset being 4 times the actual byte
offset, we get incorrect page indexing. Namely, sector X gets associated
with different pages depending on what sector is used as a starting
sector for bread. If bread starts at sector N, then data of sector N+1
is associated with (second half of) page with index 2*N; but if bread
starts at sector N+1 then it gets associated with (the first half of)
page with index 2*(N+1).

Previously, before VMIO change, data for all reads was put into separate
buffers that did not share anything between them. So the problem was
limited only to wasting memory with duplicate data, but no actual
over-runs did happen. Now we have the over-runs because the VM pages are
shared between the buffers of the same vnode.

One obvious solution is to limit bread size to 2*PAGE_SIZE = 4 *
sector_size. In this case, as before, we would waste some memory on
duplicate data but we would avoid page overruns. If we limit bread size
even more, to 1 sector, then we would not have any duplicate data at
all. But there would still be some resource waste - each page would
correspond to one sector, so 4K page would have only 2K of valid data
and the other half in each page is unused.

Another solution, which to me seems to be better, is to do "usual"
reading - pass a directory vnode and 2048-byte sector offset to bread.
In this case udf_strategy will get called for bklno translation, all
offsets and indexes will be correct and everything will work perfectly
as it is the case for all other filesystems.
The only overhead in this case comes from the need to handle
UDF_INVALID_BMAP case (where data is embedded into file entry). So it
means that we have to call bmap_internal to see if we have that special
case and hanlde it, and if the case is not special we would call bread
on a directory vnode which means that bmap_internal would be called
again in udf_strategy.
One optimization that can be done in this case is to create a ligher
version of bmap_internal that would merely check for the special case
and wouldn't do anything else.

I am attaching a patch based on the ideas above. It fixes the problem
for me and doesn't seem to create any new ones :-)
About the patch:
hunk #1 fixes a copy/paste
hunk #2 should fixes strategy to n

fs/udf: vm pages "overlap" while reading large dir [patch]

2008-02-04 Thread Andriy Gapon

More on the problem with reading big directories on UDF.

First, some sleuthing. I came to believe that the problem is caused by
some larger change in vfs/vm/buf area. It seems that now VMIO is applied
to more vnode types than before. In particular it seems that now vnodes
for devices have non-NULL v_object (or maybe this is about directory
vnodes, I am not sure).

Also it seems that the problem should happen for any directory with size
larger than four 2048-bytes sectors (I think that any directory with >
300 files would definitely qualify).

After some code reading and run-time debugging, here are some facts
about udf directory reading:
1. bread-ing is done via device vnode (as opposed to directory vnodes),
as a consequence udf_strategy is not involved in directory reading
2. the device vnode has a non-NULL v_object, this means that VMIO is used
3. it seems that the code assumed that non-VM buffers are used
4. bread is done on as much data as possible, up to MAXBSIZE [and even
slightly more in some cases] (i.e. code determines directory data size
and attempts to read in everything in one go)
5. physical sector number adjusted to DEV_BSIZE (512 bytes) sectors is
passed to bread - this is because of #1 above and peculiarities of buf
system
6. directory reading and file reading are quite different, because the
latter does reading via file vnode

Let's a consider a simple case of "directory reading" 5 * PAGE_SIZE (4K)
bytes starting from a physical sector N with N%2 = 0 from media with
sector size of 2K (most common CD/DVD case):
1. we want to read 12 KB = 3 pages = 6 sectors starting from sector N,
N%2 = 0
2. 4*N is passed as a sector number to bread
3. bo_bsize of the corresponding bufobj is a media sector size, i.e. 2K
4. actual read in bread will happen from b_iooffset of 4*N*DEV_BSIZE =
N*2K, i.e. correct offset within the device
5. for a fresh read getblk will be called
6. getblk will set b_offset to blkno*bo_bsize=4*N*2K, i.e. 4 times the
correct byte offset within the device
7. allocbuf will allocate pages using B_VMIO case
8. allocbuf calculates base page index as follows: pi = b_offset/PAGE_SIZE
this means the following:
sectors N and N+1 will be bound to a page with index 4*N*2K/4K = 2*N
sectors N+2 and N+3 will be bound to the next page 2*N +1
sectors N+4 and N+5 is tied to the next page 2*N + 2

Now let's consider a "directory read" of 2 sectors (1 page) starting at
physical sector N+1.
Using the same calculations as above, we see that the sector will be
tied to a page 2*(N+1) = 2*N + 2.
And this page already contains valid cached data from the previous read,
*but* it contains data from sectors N+4 and N+5 instead of N+1 and N+2.

So, we see, that because of b_offset being 4 times the actual byte
offset, we get incorrect page indexing. Namely, sector X gets associated
with different pages depending on what sector is used as a starting
sector for bread. If bread starts at sector N, then data of sector N+1
is associated with (second half of) page with index 2*N; but if bread
starts at sector N+1 then it gets associated with (the first half of)
page with index 2*(N+1).

Previously, before VMIO change, data for all reads was put into separate
buffers that did not share anything between them. So the problem was
limited only to wasting memory with duplicate data, but no actual
over-runs did happen. Now we have the over-runs because the VM pages are
shared between the buffers of the same vnode.

One obvious solution is to limit bread size to 2*PAGE_SIZE = 4 *
sector_size. In this case, as before, we would waste some memory on
duplicate data but we would avoid page overruns. If we limit bread size
even more, to 1 sector, then we would not have any duplicate data at
all. But there would still be some resource waste - each page would
correspond to one sector, so 4K page would have only 2K of valid data
and the other half in each page is unused.

Another solution, which to me seems to be better, is to do "usual"
reading - pass a directory vnode and 2048-byte sector offset to bread.
In this case udf_strategy will get called for bklno translation, all
offsets and indexes will be correct and everything will work perfectly
as it is the case for all other filesystems.
The only overhead in this case comes from the need to handle
UDF_INVALID_BMAP case (where data is embedded into file entry). So it
means that we have to call bmap_internal to see if we have that special
case and hanlde it, and if the case is not special we would call bread
on a directory vnode which means that bmap_internal would be called
again in udf_strategy.
One optimization that can be done in this case is to create a ligher
version of bmap_internal that would merely check for the special case
and wouldn't do anything else.

I am attaching a patch based on the ideas above. It fixes the problem
for me and doesn't seem to create any new ones :-)
About the patch:
hunk #1 fixes a copy/paste
hunk #2 should fixes strategy to not set junk b_blkno i

fs/udf: vm pages "overlap" while reading large dir

2008-02-04 Thread Andriy Gapon

More on the problem with reading big directories on UDF.

First, some sleuthing. I came to believe that the problem is caused by
some larger change in vfs/vm/buf area. It seems that now VMIO is applied
to more vnode types than before. In particular it seems that now vnodes
for devices have non-NULL v_object (or maybe this is about directory
vnodes, I am not sure).

Also it seems that the problem should happen for any directory with size
larger than four 2048-bytes sectors (I think that any directory with >
300 files would definitely qualify).

After some code reading and run-time debugging, here are some facts
about udf directory reading:
1. bread-ing is done via device vnode (as opposed to directory vnodes),
as a consequence udf_strategy is not involved in directory reading
2. the device vnode has a non-NULL v_object, this means that VMIO is used
3. it seems that the code assumed that non-VM buffers are used
4. bread is done on as much data as possible, up to MAXBSIZE [and even
slightly more in some cases] (i.e. code determines directory data size
and attempts to read in everything in one go)
5. physical sector number adjusted to DEV_BSIZE (512 bytes) sectors is
passed to bread - this is because of #1 above and peculiarities of buf
system
6. directory reading and file reading are quite different, because the
latter does reading via file vnode

Let's a consider a simple case of "directory reading" 5 * PAGE_SIZE (4K)
bytes starting from a physical sector N with N%2 = 0 from media with
sector size of 2K (most common CD/DVD case):
1. we want to read 12 KB = 3 pages = 6 sectors starting from sector N,
N%2 = 0
2. 4*N is passed as a sector number to bread
3. bo_bsize of the corresponding bufobj is a media sector size, i.e. 2K
4. actual read in bread will happen from b_iooffset of 4*N*DEV_BSIZE =
N*2K, i.e. correct offset within the device
5. for a fresh read getblk will be called
6. getblk will set b_offset to blkno*bo_bsize=4*N*2K, i.e. 4 times the
correct byte offset within the device
7. allocbuf will allocate pages using B_VMIO case
8. allocbuf calculates base page index as follows: pi = b_offset/PAGE_SIZE
this means the following:
sectors N and N+1 will be bound to a page with index 4*N*2K/4K = 2*N
sectors N+2 and N+3 will be bound to the next page 2*N +1
sectors N+4 and N+5 is tied to the next page 2*N + 2

Now let's consider a "directory read" of 2 sectors (1 page) starting at
physical sector N+1.
Using the same calculations as above, we see that the sector will be
tied to a page 2*(N+1) = 2*N + 2.
And this page already contains valid cached data from the previous read,
*but* it contains data from sectors N+4 and N+5 instead of N+1 and N+2.

So, we see, that because of b_offset being 4 times the actual byte
offset, we get incorrect page indexing. Namely, sector X gets associated
with different pages depending on what sector is used as a starting
sector for bread. If bread starts at sector N, then data of sector N+1
is associated with (second half of) page with index 2*N; but if bread
starts at sector N+1 then it gets associated with (the first half of)
page with index 2*(N+1).

Previously, before VMIO change, data for all reads was put into separate
buffers that did not share anything between them. So the problem was
limited only to wasting memory with duplicate data, but no actual
over-runs did happen. Now we have the over-runs because the VM pages are
shared between the buffers of the same vnode.

One obvious solution is to limit bread size to 2*PAGE_SIZE = 4 *
sector_size. In this case, as before, we would waste some memory on
duplicate data but we would avoid page overruns. If we limit bread size
even more, to 1 sector, then we would not have any duplicate data at
all. But there would still be some resource waste - each page would
correspond to one sector, so 4K page would have only 2K of valid data
and the other half in each page is unused.

Another solution, which to me seems to be better, is to do "usual"
reading - pass a directory vnode and 2048-byte sector offset to bread.
In this case udf_strategy will get called for bklno translation, all
offsets and indexes will be correct and everything will work perfectly
as it is the case for all other filesystems.
The only overhead in this case comes from the need to handle
UDF_INVALID_BMAP case (where data is embedded into file entry). So it
means that we have to call bmap_internal to see if we have that special
case and hanlde it, and if the case is not special we would call bread
on a directory vnode which means that bmap_internal would be called
again in udf_strategy.
One optimization that can be done in this case is to create a ligher
version of bmap_internal that would merely check for the special case
and wouldn't do anything else.

I am attaching a patch based on the ideas above. It fixes the problem
for me and doesn't seem to create any new ones :-)
About the patch:
hunk #1 fixes a copy/paste
hunk #2 should fixes strategy to not set junk b_blkno i