On Tue, Jan 11, 2011 at 10:03:16AM +0800, Li, Shaohua wrote:
> On Tue, 2011-01-11 at 09:38 +0800, Wu, Fengguang wrote:
> > On Tue, Jan 11, 2011 at 08:15:19AM +0800, Li, Shaohua wrote:
> > > On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote:
> > > > Shaohua,
> > > > 
> > > > On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote:
> > > > > Hi,
> > > > >   We have file readahead to do asyn file read, but has no metadata
> > > > > readahead. For a list of files, their metadata is stored in fragmented
> > > > > disk space and metadata read is a sync operation, which impacts the
> > > > > efficiency of readahead much. The patches try to add meatadata 
> > > > > readahead
> > > > > for btrfs.
> > > > >   In btrfs, metadata is stored in btree_inode. Ideally, if we could 
> > > > > hook
> > > > > the inode to a fd so we could use existing syscalls (readahead, 
> > > > > mincore
> > > > > or upcoming fincore) to do readahead, but the inode is hidden, there 
> > > > > is
> > > > > no easy way for this from my understanding. So we add two ioctls for
> > > > 
> > > > If that is the main obstacle, why not do straightforward fincore()/
> > > > fadvise(), and add ioctls to btrfs to export/grab the hidden
> > > > btree_inode in any form?  This will address btrfs' specific issue, and
> > > > have the benefit of making the VFS part general enough. You know
> > > > ext2/3/4 already have block_dev ready for metadata readahead.
> > > I forgot to update this comment. Please see patch 2 and patch 4, both
> > > incore and readahead need btrfs specific staff involved, so we can't use
> > > generic fincore or something.
> > 
> > You can if you like :)
> > 
> > - fincore() can return the referenced bit, which is generally
> >   useful information
> metadata page in ext2/3 doesn't have reference bit set, while btrfs has.
> we can't blindly filter out such pages with the bit.

block_dev inodes have the accessed bits. Look at the below output.

/dev/sda5 is a mounted ext4 partition.  The 'A'/'R' in the
dump_page_cache lines stand for Active/Referenced.

r...@bay /home/wfg# echo /dev/sda5 > /debug/tracing/objects/mm/pages/dump-file
r...@bay /home/wfg# cat /debug/tracing/trace
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
             zsh-2950  [003]   879.500764: dump_inode_cache:            0  
55643986944      1703936        21879 D___  BLK            mount /dev/sda5
             zsh-2950  [003]   879.500774: dump_page_cache:            0      2 
___AR_____P    2    0
             zsh-2950  [003]   879.500776: dump_page_cache:            2      3 
____R_____P    2    0
             zsh-2950  [003]   879.500777: dump_page_cache:         1026      5 
___AR_____P    2    0
             zsh-2950  [003]   879.500778: dump_page_cache:         1031      3 
___A______P    2    0
             zsh-2950  [003]   879.500779: dump_page_cache:         1034      1 
___AR_____P    2    0
             zsh-2950  [003]   879.500780: dump_page_cache:         1035      2 
___A______P    2    0
             zsh-2950  [003]   879.500781: dump_page_cache:         1037      1 
___AR_____P    2    0
             zsh-2950  [003]   879.500782: dump_page_cache:         1038      3 
____R_____P    2    0
             zsh-2950  [003]   879.500782: dump_page_cache:         1041      1 
___A______P    2    0
             zsh-2950  [003]   879.500783: dump_page_cache:         1057      1 
___AR_D___P    2    0
             zsh-2950  [003]   879.500788: dump_page_cache:         1058      6 
___A______P    2    0
             zsh-2950  [003]   879.500788: dump_page_cache:         9249      1 
___AR_____P    2    0
             zsh-2950  [003]   879.500789: dump_page_cache:       524289      1 
____R_____P    2    0
             zsh-2950  [003]   879.500790: dump_page_cache:       524290      2 
___A______P    2    0
             zsh-2950  [003]   879.500790: dump_page_cache:       524292      1 
___AR_____P    2    0
             zsh-2950  [003]   879.500791: dump_page_cache:       524293      1 
___A______P    2    0
             zsh-2950  [003]   879.500796: dump_page_cache:       524294      9 
____R_____P    2    0
             zsh-2950  [003]   879.500797: dump_page_cache:       524303      1 
___A______P    2    0
             zsh-2950  [003]   879.500798: dump_page_cache:       987136      1 
___AR_____P    2    0
             zsh-2950  [003]   879.500798: dump_page_cache:      1048576      1 
____R_____P    2    0
             zsh-2950  [003]   879.500799: dump_page_cache:      1048577      2 
___A______P    2    0
             zsh-2950  [003]   879.500800: dump_page_cache:      1048579      1 
___AR_____P    2    0
             zsh-2950  [003]   879.500801: dump_page_cache:      1048580      5 
___A______P    2    0
             zsh-2950  [003]   879.500802: dump_page_cache:      1048585      1 
___AR_____P    2    0
             zsh-2950  [003]   879.500805: dump_page_cache:      1048586      5 
___A______P    2    0
             zsh-2950  [003]   879.500805: dump_page_cache:      1048591      1 
___AR_____P    2    0
             zsh-2950  [003]   879.500806: dump_page_cache:      1572864      1 
____R_____P    2    0
             zsh-2950  [003]   879.500807: dump_page_cache:      1572865      5 
___A______P    2    0
             zsh-2950  [003]   879.500808: dump_page_cache:      1572870      1 
___AR_____P    2    0
             zsh-2950  [003]   879.500811: dump_page_cache:      1572871      6 
___A______P    2    0
             zsh-2950  [003]   879.500812: dump_page_cache:      1572877      3 
____R_____P    2    0
             zsh-2950  [003]   879.500816: dump_page_cache:      2097153      8 
____R_____P    2    0
             zsh-2950  [003]   879.500817: dump_page_cache:      2097161      1 
___A______P    2    0
             zsh-2950  [003]   879.500818: dump_page_cache:      2097162      4 
____R_____P    2    0
             zsh-2950  [003]   879.500819: dump_page_cache:      6324224      1 
____R_D___P    2    0
             zsh-2950  [003]   879.500820: dump_page_cache:      6324225      3 
___AR_____P    2    0
             zsh-2950  [003]   879.500825: dump_page_cache:      6324228     29 
___A______P    2    0
             zsh-2950  [003]   879.500826: dump_page_cache:      6324257      1 
____R_____P    2    0
             zsh-2950  [003]   879.500828: dump_page_cache:      6324258      4 
___A______P    2    0
             zsh-2950  [003]   879.500830: dump_page_cache:      6324262     11 
____R_____P    2    0
             zsh-2950  [003]   879.500833: dump_page_cache:      6324273     16 
___AR_____P    2    0
             zsh-2950  [003]   879.500833: dump_page_cache:      6324289      1 
___A______P    2    0
             zsh-2950  [003]   879.500834: dump_page_cache:      6324290      2 
___AR_____P    2    0
             zsh-2950  [003]   879.500835: dump_page_cache:      6324292      8 
___A______P    2    0
             zsh-2950  [003]   879.500836: dump_page_cache:      6324300      2 
___AR_____P    2    0
             zsh-2950  [003]   879.500837: dump_page_cache:      6324302      3 
___A______P    2    0
             zsh-2950  [003]   879.500838: dump_page_cache:      6324305      4 
____R_____P    2    0
             zsh-2950  [003]   879.500843: dump_page_cache:      6324309     28 
___AR_____P    2    0
             zsh-2950  [003]   879.500844: dump_page_cache:      6324337      4 
___A______P    2    0
             zsh-2950  [003]   879.500845: dump_page_cache:      6324341      2 
____R_____P    2    0
             zsh-2950  [003]   879.500850: dump_page_cache:      6324343     30 
___AR_____P    2    0
             zsh-2950  [003]   879.500851: dump_page_cache:      6324373      2 
___A______P    2    0
             zsh-2950  [003]   879.500852: dump_page_cache:      6324375      2 
___AR_____P    2    0
             zsh-2950  [003]   879.500853: dump_page_cache:      6324377      9 
___A______P    2    0
             zsh-2950  [003]   879.500854: dump_page_cache:      6324386      2 
___AR_____P    2    0
             zsh-2950  [003]   879.500855: dump_page_cache:      6324388      5 
___A______P    2    0
             zsh-2950  [003]   879.500856: dump_page_cache:      6324393      3 
___AR_____P    2    0
             zsh-2950  [003]   879.500858: dump_page_cache:      6324396     11 
___A______P    2    0
             zsh-2950  [003]   879.500859: dump_page_cache:      6324407      1 
____R_____P    2    0
             zsh-2950  [003]   879.500864: dump_page_cache:      6324408     31 
___AR_____P    2    0
             zsh-2950  [003]   879.500864: dump_page_cache:      6324439      1 
___A______P    2    0
             zsh-2950  [003]   879.500865: dump_page_cache:      6324440      1 
____R_____P    2    0
             zsh-2950  [003]   879.500866: dump_page_cache:      6324441      2 
___A______P    2    0
             zsh-2950  [003]   879.500867: dump_page_cache:      6324443      5 
____R_____P    2    0
             zsh-2950  [003]   879.500872: dump_page_cache:      6324448     26 
___AR_____P    2    0
             zsh-2950  [003]   879.500873: dump_page_cache:      6324474      6 
___A______P    2    0
             zsh-2950  [003]   879.500874: dump_page_cache:      6324480      4 
____R_____P    2    0
             zsh-2950  [003]   879.500879: dump_page_cache:      6324484     28 
___AR_____P    2    0
             zsh-2950  [003]   879.500880: dump_page_cache:      6324512      4 
___A______P    2    0
             zsh-2950  [003]   879.500881: dump_page_cache:      6324516      1 
____R_____P    2    0
             zsh-2950  [003]   879.500881: dump_page_cache:      6324517      1 
___A______P    2    0
             zsh-2950  [003]   879.500882: dump_page_cache:      6324518      2 
___AR_____P    2    0
             zsh-2950  [003]   879.500888: dump_page_cache:      6324520     28 
___A______P    2    0
             zsh-2950  [003]   879.500890: dump_page_cache:      6324548      2 
____R_____P    2    0

> fincore can takes a parameter or it returns a bit to distinguish
> referenced pages, but I don't think it's a good API. This should be
> transparent to userspace.

Users care about the "cached" status may well be interested in the
"active/referenced" status. They are co-related information. fincore()
won't be a simple replication of mincore() anyway. fincore() has to
deal with huge sparsely accessed files. The accessed bits of a file
page are normally more meaningful than the accessed bits of mapped
(anonymous) pages.

Another option may be to use the above
/debug/tracing/objects/mm/pages/dump-file interface.

> > - btrfs_metadata_readahead() can be passed to some (faked)
> >   ->readpages() for use with fadvise.
> this need filesystem specific hook too, the difference is your proposal
> uses fadvise but I'm using ioctl. There isn't big difference.

True for btrfs. However they make big differences for other file systems.
 
> BTW, it's hard to hook btrfs_inode to a fd even with a ioctl, at least I
> didn't find a easy way to do this. It might be possible to do this for
> example adding a fake device or fake fs (anon_inode doesn't work here,
> IIRC), which is a bit ugly. Before it's proved generic API can handle
> metadata readahead, I don't want to do it.

Right, it could be hard to export btrfs_inode. I'm glad you speak it
out. If we cannot make it, it's valuable to point out the problem and
let everyone know the root cause we turn to an ioctl based workaround.
Then others will understand the design choices, and if lucky, join us
and help export the btrfs_inode.
 
Thanks,
Fengguang

> > > > > this. One is like readahead syscall, the other is like micore/fincore
> > > > > syscall.
> > > > >   Under a harddisk based netbook with Meego, the metadata readahead
> > > > > reduced about 3.5s boot time in average from total 16s.
> > > > >   Last time I posted similar patches to btrfs maillist, which adds the
> > > > > new ioctls in btrfs specific ioctl code. But Christoph Hellwig asks we
> > > > > have a generic interface to do this so other filesystem can share some
> > > > > code, so I came up with the new one. Comments and suggestions are
> > > > > welcome!
> > > > > 
> > > > > v1->v2:
> > > > > 1. Added more comments and fix return values suggested by Andrew 
> > > > > Morton
> > > > > 2. fix a race condition pointed out by Yan Zheng
> > > > > 
> > > > > initial post:
> > > > > http://marc.info/?l=linux-fsdevel&m=129222493406353&w=2
> > > > > 
> > > > > Thanks,
> > > > > Shaohua
> > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe 
> > > > > linux-fsdevel" in
> > > > > the body of a message to majord...@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to