from:"Nikita Danilov"

Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-28 Thread Nikita Danilov

Neil Brown writes:
 > 

[...]

 > Thus the general sequence might be:
 > 
 >   a/ issue all "preceding writes".
 >   b/ issue the commit write with BIO_RW_BARRIER
 >   c/ wait for the commit to complete.
 >  If it was successful - done.
 >  If it failed other than with EOPNOTSUPP, abort
 >  else continue
 >   d/ wait for all 'preceding writes' to complete
 >   e/ call blkdev_issue_flush
 >   f/ issue commit write without BIO_RW_BARRIER
 >   g/ wait for commit write to complete
 >if it failed, abort
 >   h/ call blkdev_issue
 >   DONE
 > 
 > steps b and c can be left out if it is known that the device does not
 > support barriers.  The only way to discover this to try and see if it
 > fails.
 > 
 > I don't think any filesystem follows all these steps.

It seems that steps b/ -- h/ are quite generic, and can be implemented
once in a generic code (with some synchronization mechanism like
wait-queue at d/).

Nikita.

[...]

 > 
 > Thank you for your attention.
 > 
 > NeilBrown
 > 

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-09 Thread Nikita Danilov

Valerie Henson writes:

[...]

 > 
 > You're right about needing to read the equivalent data-structure - for
 > other reasons, each continuation inode will need an easily accessible
 > list of byte ranges covered by that inode. (Sounds like, hey,
 > extents!) The important part is that you don't have go walk all the

I see. I was under impression that idea was to use indirect blocks
themselves as that data-structure, e.g., block number 0 to mark holes,
block number 1 to mark "block not in this continuation", and all other
block numbers for real blocks.

 > indirect blocks or check your bitmap.
 > 
 > -VAL

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-05-09 Thread Nikita Danilov

Valerie Henson writes:

[...]

 > 
 > Hm, I'm not sure that everyone understands, a particular subtlety of
 > how the fsck algorithm works in chunkfs.  A lot of people seem to
 > think that you need to check *all* cross-chunk links, every time an
 > individual chunk is checked.  That's not the case; you only need to
 > check the links that go into and out of the dirty chunk.  You also
 > don't need to check the other parts of the file outside the chunk,
 > except for perhaps reading the byte range info for each continuation
 > node and making sure no two continuation inodes think they both have
 > the same range, but you don't check the indirect blocks, block
 > bitmaps, etc.

I guess I miss something. If chunkfs maintains "at most one continuation
per chunk" invariant, then continuation inode might end up with multiple
byte ranges, and to check that they do not overlap one has to read
indirect blocks (or some equivalent data-structure).

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread Nikita Danilov

David Lang writes:
 > On Tue, 24 Apr 2007, Nikita Danilov wrote:
 > 
 > > David Lang writes:
 > > > On Tue, 24 Apr 2007, Nikita Danilov wrote:
 > > >
 > > > > Amit Gud writes:
 > > > >
 > > > > Hello,
 > > > >
 > > > > >
 > > > > > This is an initial implementation of ChunkFS technique, briefly 
 > > > > > discussed
 > > > > > at: http://lwn.net/Articles/190222 and
 > > > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
 > > > >
 > > > > I have a couple of questions about chunkfs repair process.
 > > > >
 > > > > First, as I understand it, each continuation inode is a sparse file,
 > > > > mapping some subset of logical file blocks into block numbers. Then it
 > > > > seems, that during "final phase" fsck has to check that these partial
 > > > > mappings are consistent, for example, that no two different 
 > > > > continuation
 > > > > inodes for a given file contain a block number for the same offset. 
 > > > > This
 > > > > check requires scan of all chunks (rather than of only "active during
 > > > > crash"), which seems to return us back to the scalability problem
 > > > > chunkfs tries to address.
 > > >
 > > > not quite.
 > > >
 > > > this checking is a O(n^2) or worse problem, and it can eat a lot of 
 > > > memory in
 > > > the process. with chunkfs you divide the problem by a large constant 
 > > > (100 or
 > > > more) for the checks of individual chunks. after those are done then the 
 > > > final
 > > > pass checking the cross-chunk links doesn't have to keep track of 
 > > > everything, it
 > > > only needs to check those links and what they point to
 > >
 > > Maybe I failed to describe the problem presicely.
 > >
 > > Suppose that all chunks have been checked. After that, for every inode
 > > I0 having continuations I1, I2, ... In, one has to check that every
 > > logical block is presented in at most one of these inodes. For this one
 > > has to read I0, with all its indirect (double-indirect, triple-indirect)
 > > blocks, then read I1 with all its indirect blocks, etc. And to repeat
 > > this for every inode with continuations.
 > >
 > > In the worst case (every inode has a continuation in every chunk) this
 > > obviously is as bad as un-chunked fsck. But even in the average case,
 > > total amount of io necessary for this operation is proportional to the
 > > _total_ file system size, rather than to the chunk size.
 > 
 > actually, it should be proportional to the number of continuation nodes. The 
 > expectation (and design) is that they are rare.

Indeed, but total size of meta-data pertaining to all continuation
inodes is still proportional to the total file system size, and so is
fsck time: O(total_file_system_size).

What is more important, design puts (as far as I can see) no upper limit
on the number of continuation inodes, and hence, even if _average_ fsck
time is greatly reduced, occasionally it can take more time than ext2 of
the same size. This is clearly unacceptable in many situations (HA,
etc.).

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread Nikita Danilov

David Lang writes:
 > On Tue, 24 Apr 2007, Nikita Danilov wrote:
 > 
 > > Amit Gud writes:
 > >
 > > Hello,
 > >
 > > >
 > > > This is an initial implementation of ChunkFS technique, briefly discussed
 > > > at: http://lwn.net/Articles/190222 and
 > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
 > >
 > > I have a couple of questions about chunkfs repair process.
 > >
 > > First, as I understand it, each continuation inode is a sparse file,
 > > mapping some subset of logical file blocks into block numbers. Then it
 > > seems, that during "final phase" fsck has to check that these partial
 > > mappings are consistent, for example, that no two different continuation
 > > inodes for a given file contain a block number for the same offset. This
 > > check requires scan of all chunks (rather than of only "active during
 > > crash"), which seems to return us back to the scalability problem
 > > chunkfs tries to address.
 > 
 > not quite.
 > 
 > this checking is a O(n^2) or worse problem, and it can eat a lot of memory 
 > in 
 > the process. with chunkfs you divide the problem by a large constant (100 or 
 > more) for the checks of individual chunks. after those are done then the 
 > final 
 > pass checking the cross-chunk links doesn't have to keep track of 
 > everything, it 
 > only needs to check those links and what they point to

Maybe I failed to describe the problem presicely.

Suppose that all chunks have been checked. After that, for every inode
I0 having continuations I1, I2, ... In, one has to check that every
logical block is presented in at most one of these inodes. For this one
has to read I0, with all its indirect (double-indirect, triple-indirect)
blocks, then read I1 with all its indirect blocks, etc. And to repeat
this for every inode with continuations.

In the worst case (every inode has a continuation in every chunk) this
obviously is as bad as un-chunked fsck. But even in the average case,
total amount of io necessary for this operation is proportional to the
_total_ file system size, rather than to the chunk size.

 > 
 > any ability to mark a filesystem as 'clean' and then not have to check it on 
 > reboot is a bonus on top of this.
 > 
 > David Lang

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread Nikita Danilov

Amit Gud writes:

Hello,

 > 
 > This is an initial implementation of ChunkFS technique, briefly discussed
 > at: http://lwn.net/Articles/190222 and 
 > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf

I have a couple of questions about chunkfs repair process. 

First, as I understand it, each continuation inode is a sparse file,
mapping some subset of logical file blocks into block numbers. Then it
seems, that during "final phase" fsck has to check that these partial
mappings are consistent, for example, that no two different continuation
inodes for a given file contain a block number for the same offset. This
check requires scan of all chunks (rather than of only "active during
crash"), which seems to return us back to the scalability problem
chunkfs tries to address.

Second, it is not clear how, under assumption of bugs in the file system
code (which paper makes at the very beginning), fsck can limit itself
only to the chunks that were active at the moment of crash.

[...]

 > 
 > Best,
 > AG

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Finding hardlinks

2007-01-04 Thread Nikita Danilov

Mikulas Patocka writes:
 > > > BTW. How does ReiserFS find that a given inode number (or object ID in
 > > > ReiserFS terminology) is free before assigning it to new file/directory?
 > >
 > > reiserfs v3 has an extent map of free object identifiers in
 > > super-block.
 > 
 > Inode free space can have at most 2^31 extents --- if inode numbers 
 > alternate between "allocated", "free". How do you pack it to superblock?

In the worst case, when free/used extents are small, some free oids are
"leaked", but this has never been problem in practice. In fact, there
was a patch for reiserfs v3 to store this map in special hidden file but
it wasn't included in mainline, as nobody ever complained about oid map
fragmentation.

 > 
 > > reiser4 used 64 bit object identifiers without reuse.
 > 
 > So you are going to hit the same problem as I did with SpadFS --- you 
 > can't export 64-bit inode number to userspace (programs without 
 > -D_FILE_OFFSET_BITS=64 will have stat() randomly failing with EOVERFLOW 
 > then) and if you export only 32-bit number, it will eventually wrap-around 
 > and colliding st_ino will cause data corruption with many userspace 
 > programs.

Indeed, this is fundamental problem. Reiser4 tries to ameliorate it by
using hash function that starts colliding only when there are billions
of files, in which case 32bit inode number is screwed anyway.

Note, that none of the above problems invalidates reasons for having
long in-kernel inode identifiers that I outlined in other message.

 > 
 > Mikulas

Nikita.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Finding hardlinks

2007-01-01 Thread Nikita Danilov

Mikulas Patocka writes:

[...]

 > 
 > BTW. How does ReiserFS find that a given inode number (or object ID in 
 > ReiserFS terminology) is free before assigning it to new file/directory?

reiserfs v3 has an extent map of free object identifiers in
super-block. reiser4 used 64 bit object identifiers without reuse.

 > 
 > Mikulas

Nikita.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Finding hardlinks

2006-12-31 Thread Nikita Danilov

Mikulas Patocka writes:
 > 
 > 
 > On Fri, 29 Dec 2006, Trond Myklebust wrote:
 > 
 > > On Thu, 2006-12-28 at 19:14 +0100, Mikulas Patocka wrote:
 > >> Why don't you rip off the support for colliding inode number from the
 > >> kernel at all (i.e. remove iget5_locked)?
 > >>
 > >> It's reasonable to have either no support for colliding ino_t or full
 > >> support for that (including syscalls that userspace can use to work with
 > >> such filesystem) --- but I don't see any point in having half-way support
 > >> in kernel as is right now.
 > >
 > > What would ino_t have to do with inode numbers? It is only used as a
 > > hash table lookup. The inode number is set in the ->getattr() callback.
 > 
 > The question is: why does the kernel contain iget5 function that looks up 
 > according to callback, if the filesystem cannot have more than 64-bit 
 > inode identifier?

Generally speaking, file system might have two different identifiers for
files:

 - one that makes it easy to tell whether two files are the same one;

 - one that makes it easy to locate file on the storage.

According to POSIX, inode number should always work as identifier of the
first class, but not necessary as one of the second. For example, in
reiserfs something called "a key" is used to locate on-disk inode, which
in turn, contains inode number. Identifiers of the second class tend to
live in directory entries, and during lookup we want to consult inode
cache _before_ reading inode from the disk (otherwise cache is mostly
useless), right? This means that some file systems want to index inodes
in a cache by something different than inode number.

There is another reason, why I, personally, would like to have an
ability to index inodes by things other than inode numbers: delayed
inode number allocation. Strictly speaking, file system has to assign
inode number to the file only when it is just about to report it to the
user space (either though stat, or, ugh... readdir). If location of
inode on disk depends on its inode number (like it is in inode-table
based file systems like ext[23]) then delayed inode number allocation
has to same advantages as delayed block allocation.

 > 
 > This lookup callback just induces writing bad filesystems with coliding 
 > inode numbers. Either remove coda, smb (and possibly other) filesystems 
 > from the kernel or make a proper support for userspace for them.
 > 
 > The situation is that current coreutils 6.7 fail to recursively copy 
 > directories if some two directories in the tree have coliding inode 
 > number, so you get random data corruption with these filesystems.
 > 
 > Mikulas

Nikita.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-07 Thread Nikita Danilov

Christoph Hellwig writes:
 > I'd like to Cc Ulrich Drepper in this thread because he's going to decide
 > what APIs will be exposed at the C library level in the end, and he also
 > has quite a lot of experience with the various standardization bodies.
 > 
 > Ulrich, this in reply to these API proposals:
 > 
 >  
 > http://www.opengroup.org/platform/hecewg/uploads/40/10903/posix_io_readdir+.pdf
 >  
 > http://www.opengroup.org/platform/hecewg/uploads/40/10898/POSIX-stat-manpages.pdf

What readdirplus() is supposed to return in ->d_stat field for a name
"foo" in directory "bar" when "bar/foo" is a mount-point? Note that in
the case of distributed file system, server has no idea about client
mount-points, which implies some form of local post-processing.

Nikita.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: readdir behaviour

2005-08-02 Thread Nikita Danilov

Jan Blunck writes:
 > This was also topic on lkml 2 weeks ago.
 > 
 > Zitat von Tomas Hruby <[EMAIL PROTECTED]>:
 > 
 > > First of all I would like to know what exactly is the meaning of the
 > > 'offset' parameter of filldir and whether it is used somewhere? Unlike
 > > ext2, our directories are not easily read sequentially and this value
 > > (copied by filldir to dirent->d_off) seems to be quite useless outside
 > > our fs code.
 > 
 > The offset of the dirent has no common meaning. Think of it as a cookie or
 > something like that. It should not be interpreted either by VFS or by the
 > user-space.

->d_off is remembered by glibc, and returned to the user as a result of
telldir(3). As such it is valid argument for the following seekdir(3).

 > 
 > > Related question is what is the correct behaviour of readdir in case
 > > of user's seeking in the directory? If I understand correctly, in case
 > > of ext3 (indexed directories), when seeking is detected, readdir
 > > starts reading from the directory beginning again.
 > 
 > On different archs the libc is seeking (to d_off) after each call to 
 > getdents().
 > Therefore the implementation should honor it.
 > 
 > > The last is about concurrency. How is solved problem when a directory
 > > is read by readdir and between two readdir calls the same directory is
 > > changed?

Single UNIX specification
(http://www.opengroup.org/onlinepubs/007904875/functions/readdir.html)
is vague about whether directory entries added asynchronously should be
returned.

 > 
 > This is the filesystems duty to seek to the next valid dentry. Although it is
 > not defined if the new directories contents is returned or the one on
 > opendir().
 > 
 > Although I think it would be nice (and convenient to the "everything is a 
 > file"
 > paradigm) when the directory is presented like a sequential file this is not
 > the common practice. Due to the fact that there are no applications which are
 > reading and seeking the directories directly this is a good tradeoff to 
 > achieve
 > high performance for readdir().

Unfortunately, seekdir and telldir are standard (albeit optional)
interfaces, and libc translates seekdir into lseek. File systems have to
support this.

 > 
 > Jan

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Nikita Danilov

Mingming Cao <[EMAIL PROTECTED]> writes:

> On Tue, 2005-04-19 at 19:55 +0400, Nikita Danilov wrote:
>> Badari Pulavarty <[EMAIL PROTECTED]> writes:
>> 
>> > On Tue, 2005-04-19 at 04:22, Nikita Danilov wrote:
>> >> Badari Pulavarty <[EMAIL PROTECTED]> writes:
>> >> 
>> >> [...]
>> >> 
>> >> >
>> >> > Yes. Its possible to do what you want to. I am currently working on
>> >> > adding "delayed allocation" support to ext3. As part of that, We
>> >> 
>> >> As you most likely already know, Alex Thomas already implemented delayed
>> >> block allocation for ext3.
>> >
>> > Yep. I reviewed Alex Thomas patches for delayed allocation. He handled
>> > all the cases in his code and did NOT use any mpage* routines to do
>> > the work. I was hoping to change the mpage infrastructure to handle
>> > these, so that every filesystem doesn't have to do their thing.
>> >
>> 
>> Just keep in mind that filesystem != ext3. :-) Generic support makes
>> sense only when it is usable by multiple file systems. This is not
>> always possible, e.g., there is no "generic block allocator" for
>> precisely the same reason: disk space allocation policies are tightly
>> intertwined with the rest of file system internals.
>> 
>
> This generic support should be useful for ext2 and xfs. From delayed

But it won't work for reiser4, that allocates blocks _across_ multiple
files. E.g., if many files were created in the same directory,
allocation (performed just before write-out) will assign block numbers
so that files are ordered according to the readdir order on the disk
(with each file body being an interval in that ordering). This is done
by arranging all dirty blocks of a given transaction according to some
"ideal" ordering and then trying to map this ordering onto disk blocks.

As you see, in this case allocation is not done on inode-by-inode basis
at all: instead delayed allocation is done at the transaction level of
granularity, and I am trying to point out that this is natural thing for
the journalled file system to do.

The same goes for write-out: in ext3 there is only one "active"
transaction at any moment, and this means that ->writepages() calls can
go in arbitrary order, but for the file system type with multiple active
transactions that can be committed separately, order of ->writepages()
calls has to follow ordering between transactions. Again, this means
that write-out should be transaction rather than inode based.

If we want really generic support for journalling and
delayed-allocation, mpage_* functions are the wrong level. Instead
proper notion of transaction has to be introduced, and file system IO
and disk space allocation interfaces adjusted appropriately.

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Nikita Danilov

Badari Pulavarty <[EMAIL PROTECTED]> writes:

> On Tue, 2005-04-19 at 04:22, Nikita Danilov wrote:
>> Badari Pulavarty <[EMAIL PROTECTED]> writes:
>> 
>> [...]
>> 
>> >
>> > Yes. Its possible to do what you want to. I am currently working on
>> > adding "delayed allocation" support to ext3. As part of that, We
>> 
>> As you most likely already know, Alex Thomas already implemented delayed
>> block allocation for ext3.
>
> Yep. I reviewed Alex Thomas patches for delayed allocation. He handled
> all the cases in his code and did NOT use any mpage* routines to do
> the work. I was hoping to change the mpage infrastructure to handle
> these, so that every filesystem doesn't have to do their thing.
>

Just keep in mind that filesystem != ext3. :-) Generic support makes
sense only when it is usable by multiple file systems. This is not
always possible, e.g., there is no "generic block allocator" for
precisely the same reason: disk space allocation policies are tightly
intertwined with the rest of file system internals.

>
>> 
>> >
>> > In order to do the correct accounting, we need to mark a page
>> > to indicate if we reserved a block or not. One way to do this,
>> > to use page->private to indicate this. But then, all the generic
>> 
>> I believe one can use PG_mappedtodisk bit in page->flags for this
>> purpose. There was old Andrew Morton's patch that introduced new bit
>> (PG_delalloc?) for this purpose.
>
> That would be good. But I don't feel like asking for a bit in page
> if there is a way to get around it.

Clarification: PG_mappedtodisk is already here, it seems you can reuse
this already existing bit to implement delayed allocation support.

>

[...]

>> >
> Need to think some more. I guess you thought about this more than you
> do :)
>
> Thanks,
> Badari
>

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Nikita Danilov

Badari Pulavarty <[EMAIL PROTECTED]> writes:

[...]

>
> Yes. Its possible to do what you want to. I am currently working on
> adding "delayed allocation" support to ext3. As part of that, We

As you most likely already know, Alex Thomas already implemented delayed
block allocation for ext3.

[...]

>
> In order to do the correct accounting, we need to mark a page
> to indicate if we reserved a block or not. One way to do this,
> to use page->private to indicate this. But then, all the generic

I believe one can use PG_mappedtodisk bit in page->flags for this
purpose. There was old Andrew Morton's patch that introduced new bit
(PG_delalloc?) for this purpose.

> routines will fail - since they assume that page->private represents
> bufferheads. So we need a better way to do this.

They are not generic then. Some file systems store things completely
different from buffer head ring in page->private.

>
> 3) We need add hooks into filesystem specific calls from these
> generic routines to handle "journaling mode" requirements
> (for ext3 and may be others).

Please don't. There is no such thing as "generic
journalling". Traditional WAL used by ext3, phase-trees of Tux2, and
wandering logs of reiser4 are so much different that there is no hope
for a single API to accommodate them all. Adding such API will only
force more workarounds and hacks in non-ext3 file systems.

What _is_ common to all journalling file systems on the other hand, is
the notion of transaction as the natural unit of caching and
write-out. Currently in Linux, write-out is inode-based
(->writepages()). Reiser4 already has a patch that replaces
sync_sb_inodes() function with super-block operation. In reiser4 case,
this operation scans the list of transactions (instead of the list of
inodes) and writes some of them out, which is natural thing to do for a
journalled file system.

Similarly, transaction is a unit of caching: it's often necessary to
scan all pages of a given transaction, all dirty pages of a given
transaction, or to check whether given page belongs to a given
transaction. That is, transaction plays role similar to struct
address_space. But currently there is 1-to-1 relation between inodes and
address_spaces, and this forces file system to implement additional data
structures to duplicate functionality already present in address_space.

>
> So, what are your requirements ?  I am looking for a common
> way to combine all the requirements and come out with a
> saner "generic" routines to handle these.
>

I think that one reasonable way to add generic support for journalling
is to split struct address_space into two objects: lower layer that
represents "file" (say, struct vm_file), in which pages are linearly
ordered, and on top of this vm_cache (representing transaction) that
keeps track of pages from various vm_file's. vm_file is embedded into
inode, and vm_cache has a pointer to (the analog of) struct
address_space_operations.

vm_cache's are created by file system back-end as necessary (can be
embedded into inode for non-journalled file systems). VM scanner and
balance_dirty_pages() call vm_cache operations to do write-out.

>
> Thanks,
> Badari

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Lilo requirements (Was: Re: Address space operations questions)

2005-04-17 Thread Nikita Danilov

Martin Jambor writes:
 > Thanks for your reply, I found the the following thing interesting on its 
 > own:
 > 
 > On 4/7/05, Nikita Danilov <[EMAIL PROTECTED]> wrote:
 > > Consider tools like LILO that want stable block numbers for certain
 > > files. In reiserfs (both v3 and v4) there is an ioctl that disables
 > > relocation for a given file. Besides, I do not think ->bmap() is useless
 > > even when block numbers are volatile, for one thing it allows user level
 > > to track how file is laid out (for example, to measure fragmentation).
 > 
 > I tried to google out what behaviour lilo requires filesystems to
 > exhibit without much success... is that information available
 > somnewhere I din't look? Is that simple enought to be explained here?

As opposed to, say, GRUB, LILO doesn't parse file system layout at the
boot time. Instead it remembers in what blocks kernel image is
stored. This assumes following properties of the file system:

 - unit of disk space allocation for the kernel image file is
 block. That is, optimizations like UFS fragments or reiserfs tails are
 not applied, and

 - blocks that kernel image is stored into are real disk blocks (i.e.,
 there is a way to disable "delayed allocation"), and

 - kernel image file is not relocated, i.e., data are not moved into
 another blocks on the fly.

Currently the only file system that doesn't satisfy any of there
requirements is reiserfs, and it has special ioctl REISERFS_IOC_UNPACK
that forces LILO friendly behaviour for a specified file: no tails, no
delayed allocation, and no relocation. LILO detects when kernel image is
on reiserfs and calls that ioctl.

 > 
 > TIA
 > 
 > Martin

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Address space operations questions

2005-04-07 Thread Nikita Danilov

Martin Jambor writes:
 > Thank you very much for your reply.
 > 
 > On Mar 30, 2005 3:55 PM, Nikita Danilov <[EMAIL PROTECTED]> wrote:
 > >  > 1. What is bmap for and what is it supposed to do?
 > > 
 > > ->bmap() maps logical block offset within "object" to physical block
 > > number. It is used in few places, notably in the implementation of
 > > FIBMAP ioctl.
 > 
 > We are about to start implementing a fs where data can move around the
 > device and so a physical block address is not really useful. I have
 > understood from other postings to this list that reiserfs and ntfs
 > don;t implement this method so I suppose we'll do the same. I'll just
 > find some nice error to return.

Consider tools like LILO that want stable block numbers for certain
files. In reiserfs (both v3 and v4) there is an ioctl that disables
relocation for a given file. Besides, I do not think ->bmap() is useless
even when block numbers are volatile, for one thing it allows user level
to track how file is laid out (for example, to measure fragmentation).

[...]

 > 
 > OK, so if I understand it well, sync_page does not actually write the
 > page anywhere, it only waits until the device driver finishes all
 > previous requests with that page, right? Does block_sync_page do

No. ->sync_page() doesn't wait for anything. It simply tells to the
underlying storage layer "start executing all queued IO requests". If
your file system uses block device as a storage, use block_sync_page as
your ->sync_page() method.

 > exactly that? (I would read the source but all it does is that it
 > calls a callback function) BTW, does it wait also for metadata?

No difference between data and meta-data at this level.

 > 
 > Martin

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Add support for semaphore-like structure with support for asynchronous I/O

2005-03-31 Thread Nikita Danilov

Trond Myklebust writes:
 > to den 31.03.2005 Klokka 12:02 (+0400) skreiv Nikita Danilov:
 > 
 > > As I understand it, in the blocking path IOSEM_LOCK_EXCLUSIVE is set by
 > > iosem_lock_wake_function() called by the waker thread. But this is
 > > asking for convoy formation: iosem_unlock() transfers ownership of the
 > > lock to the thread that is currently sleeping. This means that all
 > > threads _running_ on another processors and bumping into that lock will
 > > go to sleep too (i.e., lock is owned but unused), thus forming a
 > > "convoy" that has a tendency to grow over time when there is at least
 > > smallest contention. This is known problem with all "early ownership
 > > transfer" locks designs (except maybe in your case contention is not
 > > supposed to happen).
 > 
 > You are assuming that all the waiters on the queue are tasks that must
 > sleep if they cannot take the lock. That is not the case here. Whereas
 > some users will indeed fall in this category, I expect that most will
 > rather want to use the non-blocking mode in which the caller is free to
 > go off and do other useful work.

Ah, I see... But this doesn't look like semaphore _at_ _all_. Semaphores
have no call-backs, and in iosem case it's callback (in the form of
struct work_struct) that is central to the interface. I belive naming should
reflect this, it's utterly confusing as it is. Maybe struct
work_queue_token and schedule_work_{with,end}_token()?

[...]

 > 
 > Cheers,
 >   Trond

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Add support for semaphore-like structure with support for asynchronous I/O

2005-03-31 Thread Nikita Danilov

Trond Myklebust writes:
 > In NFSv4 we often want to serialize asynchronous RPC calls with ordinary

[...]

 > +
 > +void fastcall iosem_lock(struct iosem *lk)
 > +{
 > +struct iosem_wait waiter;
 > +
 > +might_sleep();
 > +
 > +init_iosem_waiter(&waiter);
 > +waiter.wait.func = iosem_lock_wake_function;
 > +
 > +set_current_state(TASK_UNINTERRUPTIBLE);
 > +if (__iosem_lock(lk, &waiter))
 > +schedule();
 > +__set_current_state(TASK_RUNNING);
 > +
 > +BUG_ON(!list_empty(&waiter.wait.task_list));
 > +}
 > +EXPORT_SYMBOL(iosem_lock);

As I understand it, in the blocking path IOSEM_LOCK_EXCLUSIVE is set by
iosem_lock_wake_function() called by the waker thread. But this is
asking for convoy formation: iosem_unlock() transfers ownership of the
lock to the thread that is currently sleeping. This means that all
threads _running_ on another processors and bumping into that lock will
go to sleep too (i.e., lock is owned but unused), thus forming a
"convoy" that has a tendency to grow over time when there is at least
smallest contention. This is known problem with all "early ownership
transfer" locks designs (except maybe in your case contention is not
supposed to happen).

And as a nitpick: struct iosem is emphatically _not_ a semaphore, it
even doesn't have a counter. :) Can it be named iomutex or iolock or
async_lock or something? We have enough confusion going on with struct
semaphore that is mostly used as mutex.

[...]

 > +
 > +int fastcall iosem_lock_and_schedule_work(struct iosem *lk, struct 
 > iosem_work *wk)
 > +{
 > +int ret;
 > +
 > +init_iosem_waiter(&wk->waiter);
 > +wk->waiter.wait.func = iosem_lock_and_schedule_function;
 > +ret = __iosem_lock(lk, &wk->waiter);
 > +if (ret == 0)
 > +ret = schedule_work(&wk->work);
 > +return ret;
 > +}

This is actually trylock, right? If iosem_lock_and_schedule_work()
returns -EINPROGRESS lock is not acquired on return and caller has to
call schedule().

[...]

 > 
 > -- 
 > Trond Myklebust <[EMAIL PROTECTED]>
 > 

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Address space operations questions

2005-03-30 Thread Nikita Danilov

Martin Jambor writes:
 > Hi,
 > 
 > I have problems understanding the purpose of different entries of
 > struc address_space_operations in 2.6 kernels:
 > 
 > 1. What is bmap for and what is it supposed to do?

->bmap() maps logical block offset within "object" to physical block
number. It is used in few places, notably in the implementation of
FIBMAP ioctl.

 > 
 > 2. What is the difference between sync_page and write_page?

(It is spelt ->writepage() by the way).

->sync_page() is an awful misnomer. Usually, when page IO operation is
requested by calling ->writepage() or ->readpage(), file-system queues
IO request (e.g., disk-based file system may do this my calling
submit_bio()), but underlying device driver does not proceed with this
IO immediately, because IO scheduling is more efficient when there are
multiple requests in the queue.

Only when something really wants to wait for IO completion
(wait_on_page_{locked,writeback}() are used to wait for read and write
completion respectively) IO queue is processed. To do this
wait_on_page_bit() calls ->sync_page() (see block_sync_page()---standard
implementation of ->sync_page() for disk-based file systems).

So, semantics of ->sync_page() are roughly "kick underlying storage
driver to actually perform all IO queued for this page, and, maybe, for
other pages on this device too".

 > 
 > 3. What exactly (fs independent) is the relation in between
 > write_page, prepare_write and commit_write? Does prepare make sure a
 > page can be written (like allocating space), commit mark it dirty a
 > write write it sometime later on?

->prepare_write() and ->commit_write() are only used by
generic_file_write() (so, one may argue that they shouldn't be placed
into struct address_space at all).

generic_file_write() has a loop for each page overlapping with portion
of file that write goes into:

 a_ops->prepare_write(file, page, from, to);
 copy_from_user(...);
 a_ops->commit_write(file, page, from, to);

In page is partially overwritten, ->prepare_write() has to read parts of
the page that are not covered by write. ->commit_write() is expected to
mark page (or buffers) and inode dirty, and update inode size, if write
extends file.

As for block allocation and transaction handling, this is up to the file
system back end.

Usually ->commit_write() doesn't start IO by itself, it just marks pages
dirty. Write-out is done by balance_dirty_pages_ratelimited(): when
number of dirty pages in the system exceeds some threshold, kernel calls
->writepages() of dirty inodes.

->writepage() is used in two places:

- by VM scanner to write out dirty page from tail of the inactive
list.  This is "rare" path, because balance_dirty_pages() is
supposed to keep amount of dirty pages under control.

- by mpage_writepages(): default implementation of ->writepages()
method.

 > 
 > Thak you very much for any insight,
 > 
 > Martin

Hope this helps.

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] [PATCH] Generic mpage_writepage() support

2005-02-16 Thread Nikita Danilov

Badari Pulavarty writes:
 > On Tue, 2005-02-15 at 09:54, Andrew Morton wrote:
 > > Badari Pulavarty <[EMAIL PROTECTED]> wrote:
 > > >
 > > > Yep. nobh_prepare_write() doesn't add any bufferheads. But
 > > >  we call block_write_full_page() even for "nobh" case, which 
 > > >  does create bufferheads, attaches to the page and operates
 > > >  on them..
 > > 
 > > hmm, yeah, OK, we'll attach bh's in that case.  It's a rare case though -
 > > when a dirty page falls off the end of the LRU.  There's no particular
 > > reason why we cannot have a real mpage_writepage() which doesn't use bh's
 > > and employ that.
 > > 
 > > I coulda sworn we used to have one.
 > 
 > Hi Andrew,
 > 
 > Here is my first version of mpage_writepage() patch.
 > I haven't handle the "confused" case yet - I need to
 > pass a function pointer to handle it. Just for
 > initial code review. I am still testing it.
 > 
 > Thanks,
 > Badari
 > 
 > 
 > diff -Narup -X dontdiff linux-2.6.10/fs/ext2/inode.c 
 > linux-2.6.10.nobh/fs/ext2/inode.c
 > --- linux-2.6.10/fs/ext2/inode.c 2004-12-24 13:33:51.0 -0800

[...]

 >  return ret;
 >  }
 > +
 > +/*
 > + * The generic ->writepage function for address_spaces
 > + */

This function doesn't look generic. It only works correctly with file
systems that store pointer to buffer head ring in page->private (at
least temporarily), otherwise code after page_has_buffers(page) check in
__mpage_writepage() will corrupt page->private.

Actually, this looks confusing. I thought that main idea of mpage.c is
to get rid of buffer heads, and switch everything to bios. But looking
at the current code it seems that buffer heads are striking back: code
simply assumes that PG_private means "buffers in page->private", making
mpage.c effectively useless for file systems using page->private for
something else.

There is another reason why mpage_writepage() is a problematic choice
for ->writepage: __mpage_writepage() calls
page->mapping->a_ops->writepage() in "confused" case, which sounds like
infinite recursion.

[...]

 > +if (page->index >= end_index+1 || !offset) {
 > +/*
 > + * The page may have dirty, unmapped buffers.  For example,
 > + * they may have been added in ext3_writepage().  Make them
 > + * freeable here, so the page does not leak.
 > + */
 > +block_invalidatepage(page, 0);

Shouldn't this be

page->mapping->a_ops->invalidatepage(page, 0)

? To preserve external appearance of "genericity", that is. :)

 > +unlock_page(page);
 > +return 0; /* don't care */
 > +}
 > +
 > +/*
 > + * The page straddles i_size.  It must be zeroed out on each and every
 > + * writepage invokation because it may be mmapped.  "A file is mapped

Typo: should be invocation (at least beyond Australia).

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Bufferheads & page-cache reference

2005-02-15 Thread Nikita Danilov

Andrew Morton <[EMAIL PROTECTED]> writes:

> Badari Pulavarty <[EMAIL PROTECTED]> wrote:
>>
>> Yep. nobh_prepare_write() doesn't add any bufferheads. But
>>  we call block_write_full_page() even for "nobh" case, which 
>>  does create bufferheads, attaches to the page and operates
>>  on them..
>
> hmm, yeah, OK, we'll attach bh's in that case.  It's a rare case though -
> when a dirty page falls off the end of the LRU.  There's no particular

Maybe DB2 dirties pages through mmap?

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

Re: [RFC] TileFS - a proposal for scalable integrity checking

Re: [RFC] TileFS - a proposal for scalable integrity checking

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

Re: Finding hardlinks

Re: Finding hardlinks

Re: Finding hardlinks

Re: NFSv4/pNFS possible POSIX I/O API standards

Re: readdir behaviour

Re: Lazy block allocation and block_prepare_write?

Re: Lazy block allocation and block_prepare_write?

Re: Lazy block allocation and block_prepare_write?

Re: Lilo requirements (Was: Re: Address space operations questions)

Re: Address space operations questions

Re: [RFC] Add support for semaphore-like structure with support for asynchronous I/O

Re: [RFC] Add support for semaphore-like structure with support for asynchronous I/O

Re: Address space operations questions

Re: [RFC] [PATCH] Generic mpage_writepage() support

Re: Bufferheads & page-cache reference

21 matches

Site Navigation

Mail list logo

Footer information