Re: [RFC 2/9] Use NOMEMALLOC reclaim to allow reclaim if PF_MEMALLOC is set

2007-08-23 Thread Nikita Danilov
Peter Zijlstra writes:

[...]

 > My idea is to extend kswapd, run cpus_per_node instances of kswapd per
 > node for each of GFP_KERNEL, GFP_NOFS, GFP_NOIO. (basically 3 kswapds
 > per cpu)
 > 
 > whenever we would hit direct reclaim, add ourselves to a special
 > waitqueue corresponding to the type of GFP and kick all the
 > corresponding kswapds.

There are two standard objections to this:

- direct reclaim was introduced to reduce memory allocation latency,
  and going to scheduler kills this. But more importantly,

- it might so happen that _all_ per-cpu kswapd instances are
  blocked, e.g., waiting for IO on indirect blocks, or queue
  congestion. In that case whole system stops waiting for IO to
  complete. In the direct reclaim case, other threads can continue
  zone scanning.

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/23] per device dirty throttling -v8

2007-08-04 Thread Nikita Danilov
Andrew Morton writes:

[...]

 > 
 > It's pretty much unfixable given the ext3 journalling design, and the
 > guarantees which data-ordered provides.

ZFS has intent log to handle this
(http://blogs.sun.com/realneel/entry/the_zfs_intent_log). Something like
that can --theoretically-- be added to ext3-style journalling.

Nikita.

 > 
 > The easy preventive is to mount with data=writeback.  Maybe that should
 > have been the default.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How innovative is Linux?

2007-06-24 Thread Nikita Danilov
Alan Cox writes:

[...]

 > 
 > A few innovations that afaik first appeared the Linux kernel
 > - Making multiple hosts appear transparently as one IP address
 > - Futex fast hybrid locking

DEC Firefly workstation, before 1987.

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-28 Thread Nikita Danilov
Neil Brown writes:
 > 

[...]

 > Thus the general sequence might be:
 > 
 >   a/ issue all "preceding writes".
 >   b/ issue the commit write with BIO_RW_BARRIER
 >   c/ wait for the commit to complete.
 >  If it was successful - done.
 >  If it failed other than with EOPNOTSUPP, abort
 >  else continue
 >   d/ wait for all 'preceding writes' to complete
 >   e/ call blkdev_issue_flush
 >   f/ issue commit write without BIO_RW_BARRIER
 >   g/ wait for commit write to complete
 >if it failed, abort
 >   h/ call blkdev_issue
 >   DONE
 > 
 > steps b and c can be left out if it is known that the device does not
 > support barriers.  The only way to discover this to try and see if it
 > fails.
 > 
 > I don't think any filesystem follows all these steps.

It seems that steps b/ -- h/ are quite generic, and can be implemented
once in a generic code (with some synchronization mechanism like
wait-queue at d/).

Nikita.

[...]

 > 
 > Thank you for your attention.
 > 
 > NeilBrown
 > 

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc] lock bitops

2007-05-09 Thread Nikita Danilov
Nick Piggin writes:
 > Hi,

[...]

 >  
 >  /**
 > + * clear_bit_unlock - Clears a bit in memory with release
 > + * @nr: Bit to clear
 > + * @addr: Address to start counting from
 > + *
 > + * clear_bit() is atomic and may not be reordered.  It does

s/clear_bit/clear_bit_unlock/ ?

 > + * contain a memory barrier suitable for unlock type operations.
 > + */
 > +static __inline__ void
 > +clear_bit_unlock (int nr, volatile void *addr)
 > +{

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread Nikita Danilov
David Lang writes:
 > On Tue, 24 Apr 2007, Nikita Danilov wrote:
 > 
 > > David Lang writes:
 > > > On Tue, 24 Apr 2007, Nikita Danilov wrote:
 > > >
 > > > > Amit Gud writes:
 > > > >
 > > > > Hello,
 > > > >
 > > > > >
 > > > > > This is an initial implementation of ChunkFS technique, briefly 
 > > > > > discussed
 > > > > > at: http://lwn.net/Articles/190222 and
 > > > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
 > > > >
 > > > > I have a couple of questions about chunkfs repair process.
 > > > >
 > > > > First, as I understand it, each continuation inode is a sparse file,
 > > > > mapping some subset of logical file blocks into block numbers. Then it
 > > > > seems, that during "final phase" fsck has to check that these partial
 > > > > mappings are consistent, for example, that no two different 
 > > > > continuation
 > > > > inodes for a given file contain a block number for the same offset. 
 > > > > This
 > > > > check requires scan of all chunks (rather than of only "active during
 > > > > crash"), which seems to return us back to the scalability problem
 > > > > chunkfs tries to address.
 > > >
 > > > not quite.
 > > >
 > > > this checking is a O(n^2) or worse problem, and it can eat a lot of 
 > > > memory in
 > > > the process. with chunkfs you divide the problem by a large constant 
 > > > (100 or
 > > > more) for the checks of individual chunks. after those are done then the 
 > > > final
 > > > pass checking the cross-chunk links doesn't have to keep track of 
 > > > everything, it
 > > > only needs to check those links and what they point to
 > >
 > > Maybe I failed to describe the problem presicely.
 > >
 > > Suppose that all chunks have been checked. After that, for every inode
 > > I0 having continuations I1, I2, ... In, one has to check that every
 > > logical block is presented in at most one of these inodes. For this one
 > > has to read I0, with all its indirect (double-indirect, triple-indirect)
 > > blocks, then read I1 with all its indirect blocks, etc. And to repeat
 > > this for every inode with continuations.
 > >
 > > In the worst case (every inode has a continuation in every chunk) this
 > > obviously is as bad as un-chunked fsck. But even in the average case,
 > > total amount of io necessary for this operation is proportional to the
 > > _total_ file system size, rather than to the chunk size.
 > 
 > actually, it should be proportional to the number of continuation nodes. The 
 > expectation (and design) is that they are rare.

Indeed, but total size of meta-data pertaining to all continuation
inodes is still proportional to the total file system size, and so is
fsck time: O(total_file_system_size).

What is more important, design puts (as far as I can see) no upper limit
on the number of continuation inodes, and hence, even if _average_ fsck
time is greatly reduced, occasionally it can take more time than ext2 of
the same size. This is clearly unacceptable in many situations (HA,
etc.).

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread Nikita Danilov
David Lang writes:
 > On Tue, 24 Apr 2007, Nikita Danilov wrote:
 > 
 > > Amit Gud writes:
 > >
 > > Hello,
 > >
 > > >
 > > > This is an initial implementation of ChunkFS technique, briefly discussed
 > > > at: http://lwn.net/Articles/190222 and
 > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
 > >
 > > I have a couple of questions about chunkfs repair process.
 > >
 > > First, as I understand it, each continuation inode is a sparse file,
 > > mapping some subset of logical file blocks into block numbers. Then it
 > > seems, that during "final phase" fsck has to check that these partial
 > > mappings are consistent, for example, that no two different continuation
 > > inodes for a given file contain a block number for the same offset. This
 > > check requires scan of all chunks (rather than of only "active during
 > > crash"), which seems to return us back to the scalability problem
 > > chunkfs tries to address.
 > 
 > not quite.
 > 
 > this checking is a O(n^2) or worse problem, and it can eat a lot of memory 
 > in 
 > the process. with chunkfs you divide the problem by a large constant (100 or 
 > more) for the checks of individual chunks. after those are done then the 
 > final 
 > pass checking the cross-chunk links doesn't have to keep track of 
 > everything, it 
 > only needs to check those links and what they point to

Maybe I failed to describe the problem presicely.

Suppose that all chunks have been checked. After that, for every inode
I0 having continuations I1, I2, ... In, one has to check that every
logical block is presented in at most one of these inodes. For this one
has to read I0, with all its indirect (double-indirect, triple-indirect)
blocks, then read I1 with all its indirect blocks, etc. And to repeat
this for every inode with continuations.

In the worst case (every inode has a continuation in every chunk) this
obviously is as bad as un-chunked fsck. But even in the average case,
total amount of io necessary for this operation is proportional to the
_total_ file system size, rather than to the chunk size.

 > 
 > any ability to mark a filesystem as 'clean' and then not have to check it on 
 > reboot is a bonus on top of this.
 > 
 > David Lang

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread Nikita Danilov
Amit Gud writes:

Hello,

 > 
 > This is an initial implementation of ChunkFS technique, briefly discussed
 > at: http://lwn.net/Articles/190222 and 
 > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf

I have a couple of questions about chunkfs repair process. 

First, as I understand it, each continuation inode is a sparse file,
mapping some subset of logical file blocks into block numbers. Then it
seems, that during "final phase" fsck has to check that these partial
mappings are consistent, for example, that no two different continuation
inodes for a given file contain a block number for the same offset. This
check requires scan of all chunks (rather than of only "active during
crash"), which seems to return us back to the scalability problem
chunkfs tries to address.

Second, it is not clear how, under assumption of bugs in the file system
code (which paper makes at the very beginning), fsck can limit itself
only to the chunks that were active at the moment of crash.

[...]

 > 
 > Best,
 > AG

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Introduce a handy list_first_entry macro

2007-04-18 Thread Nikita Danilov
Pavel Emelianov writes:
 > There are many places in the kernel where the construction like
 > 
 >foo = list_entry(head->next, struct foo_struct, list);
 > 
 > are used. 
 > The code might look more descriptive and neat if using the macro
 > 
 >list_first_entry(head, type, member) \
 >  list_entry((head)->next, type, member)

Won't list_next_entry() be more descriptive name for that?

Nikita.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ZFS with Linux: An Open Plea

2007-04-15 Thread Nikita Danilov
Ignatich writes:
 > You might want to look at this discussion:
 > http://mail.opensolaris.org/pipermail/zfs-discuss/2007-April/027041.html

Licenses involved cover file system _code_, rather than storage format
that is openly specified. Just stand up and implement driver for zfs
format from scratch under whatever license you want. This is exactly how
Linux supports "foreign" file systems (ntfs, fat, etc.).

Nikita.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 12/13] maps: Add /proc/pid/pagemap interface

2007-04-04 Thread Nikita Danilov
Matt Mackall writes:

[...]

 > 
 > Now I could adjust these to only export u64s in some preferred
 > endianness. But given I already need details like the page size to
 > make any sense of it, it seems unnecessary. Also, the PFNs are fairly
 > opaque unless you're attempting to correlate them with /proc/kpagemap.

Alternatively, you can export some meta-data at the beginning of that
file like /proc/profile does.

Nikita.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 12/13] maps: Add /proc/pid/pagemap interface

2007-04-04 Thread Nikita Danilov
Matt Mackall writes:
 > Add /proc/pid/pagemap interface
 > 
 > This interface provides a mapping for each page in an address space to
 > its physical page frame number, allowing precise determination of what
 > pages are mapped and what pages are shared between processes.

[...]

 >  
 > +#ifdef CONFIG_PROC_PAGEMAP
 > +struct pagemapread {
 > +struct mm_struct *mm;
 > +unsigned long next;
 > +unsigned long *buf;
 > +unsigned long pos;
 > +size_t count;
 > +int index;
 > +char __user *out;
 > +};
 > +
 > +static int flush_pagemap(struct pagemapread *pm)
 > +{
 > +int n = min(pm->count, pm->index * sizeof(unsigned long));
 > +if (copy_to_user(pm->out, pm->buf, n))
 > +return -EFAULT;

This pushes binary data to the user space. Wasn't /proc supposed to be
ascii-based to avoid compatibility problems (e.g., size of unsigned long
changing, endianness, etc.)?

Nikita.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch] queued spinlocks (i386)

2007-03-24 Thread Nikita Danilov
Ingo Molnar writes:
 > 
 > * Nikita Danilov <[EMAIL PROTECTED]> wrote:
 > 
 > > Indeed, this technique is very well known. E.g., 
 > > http://citeseer.ist.psu.edu/anderson01sharedmemory.html has a whole 
 > > section (3. Local-spin Algorithms) on them, citing papers from the 
 > > 1990 onward.
 > 
 > that is a cool reference! So i'd suggest to do (redo?) the patch based 
 > on those concepts and that terminology and not use 'queued spinlocks' 

There is some old version:

http://namesys.com/pub/misc-patches/unsupported/extra/2004.02.04/p06-locallock.patch
http://namesys.com/pub/misc-patches/unsupported/extra/2004.02.04/p07-locallock-bkl.patch
http://namesys.com/pub/misc-patches/unsupported/extra/2004.02.04/p08-locallock-zone.patch

http://namesys.com/pub/misc-patches/unsupported/extra/2004.02.04/p0b-atomic_dec_and_locallock.patch
http://namesys.com/pub/misc-patches/unsupported/extra/2004.02.04/p0c-locallock-dcache.patch

This version retains original spin-lock interface (i.e., no additional
"queue link" pointer is passed to the locking function). As a result,
lock data structure contains an array of NR_CPU counters, so it's only
suitable for global statically allocated locks.

 > that are commonly associated with MS's stuff. And as a result the 
 > contended case would be optimized some more via local-spin algorithms. 
 > (which is not a key thing for us, but which would be nice to have 
 > nevertheless)
 > 
 >  Ingo

Nikita.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch] queued spinlocks (i386)

2007-03-24 Thread Nikita Danilov
Nick Piggin writes:
 > On Fri, Mar 23, 2007 at 11:04:18AM +0100, Ingo Molnar wrote:
 > > 
 > > * Nick Piggin <[EMAIL PROTECTED]> wrote:
 > > 
 > > > Implement queued spinlocks for i386. [...]
 > > 
 > > isnt this patented by MS? (which might not worry you SuSE/Novell guys, 
 > > but it might be a worry for the rest of the world ;-)
 > 
 > Hmm, it looks like they have implemented a system where the spinning
 > cpu sleeps on a per-CPU variable rather than the lock itself, and
 > the releasing cpu writes to that variable to wake it.  They do this
 > so that spinners don't continually perform exclusive->shared
 > transitions on the lock cacheline. They call these things queued
 > spinlocks.  They don't seem to be very patent worthy either, but

Indeed, this technique is very well known. E.g.,
http://citeseer.ist.psu.edu/anderson01sharedmemory.html has a whole
section (3. Local-spin Algorithms) on them, citing papers from the 1990
onward.

Nikita.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] split file and anonymous page queues #3

2007-03-21 Thread Nikita Danilov
Rik van Riel writes:
 > Nikita Danilov wrote:
 > 
 > > Generally speaking, multi-queue replacement mechanisms were tried in the
 > > past, and they all suffer from the common drawback: once scanning rate
 > > is different for different queues, so is the notion of "hotness",
 > > measured by scanner. As a result multi-queue scanner fails to capture
 > > working set properly.
 > 
 > You realize that the current "single" queue in the 2.6 kernel
 > has this problem in a much worse way: when swappiness is low
 > and the kernel does not want to reclaim mapped pages, it will
 > randomly rotate those pages around the list.

Agree. Some time ago I tried to solve this very problem with
dont-rotate-active-list patch
(http://linuxhacker.ru/~nikita/patches/2.6.12-rc6/2005.06.11/vm_03-dont-rotate-active-list.patch),
but it had problems on its own.

 > 
 > In addition, the referenced bit on unmapped page cache pages
 > was ignored completely, making it impossible for the VM to
 > separate the page cache working set from transient pages due
 > to streaming IO.

Yes, basically FIFO for clean file system pages and FIFO-second-chance
for dirty file pages. Very bad.

 > 
 > I agree that we should put some more negative feedback in
 > place if it turns out we need it.  I have refault code ready
 > that can be plugged into this patch, but I don't want to add
 > the overhead of such code if it turns out we do not actually
 > need it.

In my humble opinion VM already has too many mechanisms that are
supposed to help in corner cases, but there is little to do with that,
except for major rewrite.

Nikita.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] split file and anonymous page queues #3

2007-03-21 Thread Nikita Danilov
Rik van Riel writes:
 > Rik van Riel wrote:
 > > Nikita Danilov wrote:
 > > 
 > >> Probably I am missing something, but I don't see how that can help. For
 > >> example, suppose (for simplicity) that we have swappiness of 100%, and
 > >> that fraction of referenced anon pages gets slightly less than of file
 > >> pages. get_scan_ratio() increases anon_percent, and shrink_zone() starts
 > >> scanning anon queue more aggressively. As a result, pages spend less
 > >> time there, and have less chance of ever being accessed, reducing
 > >> fraction of referenced anon pages further, and triggering further
 > >> increase in the amount of scanning, etc. Doesn't this introduce positive
 > >> feed-back loop?
 > > 
 > > It's a possibility, but I don't think it will be much of an
 > > issue in practice.
 > > 
 > > If it is, we can always use refaults as a correcting
 > > mechanism - which would have the added benefit of being
 > > able to do streaming IO without putting any pressure on
 > > the active list, essentially clock-pro replacement with
 > > just some tweaks to shrink_list()...
 > 
 > As an aside, due to the use-once algorithm file pages are at a
 > natural disadvantage already.  I believe it would be really
 > hard to construct a workload where anon pages suffer the positive
 > feedback loop you describe...

That scenario works for file queues too. Of course, all this is but a
theoretical speculation at this point, but I am concerned that

 - that loop would tend to happen under various border conditions,
 making it hard to isolate, diagnose, and debug, and

 - long before it becomes explicitly visible (say, as an excessive cpu
 consumption by scanner), it would ruin global lru ordering, degrading
 overall performance.

Generally speaking, multi-queue replacement mechanisms were tried in the
past, and they all suffer from the common drawback: once scanning rate
is different for different queues, so is the notion of "hotness",
measured by scanner. As a result multi-queue scanner fails to capture
working set properly.

Nikita.


 > 
 > -- 
 > Politics is the struggle between those who want to make their country
 > the best in the world, and those who believe it already is.  Each group
 > calls the other unpatriotic.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] split file and anonymous page queues #3

2007-03-21 Thread Nikita Danilov
Rik van Riel writes:
 > Nikita Danilov wrote:
 > > Rik van Riel writes:
 > >  > [ OK, I suck.  I edited yesterday's email with the new info, but forgot
 > >  >to change the attachment to today's patch.  Here is today's patch. ]
 > >  > 
 > >  > Split the anonymous and file backed pages out onto their own pageout
 > >  > queues.  This we do not unnecessarily churn through lots of anonymous
 > >  > pages when we do not want to swap them out anyway.
 > > 
 > > Won't this re-introduce problems similar to ones due to split
 > > inactive_clean/inactive_dirty queues we had in the past?
 > > 
 > > For example, by rotating anon queues faster than file queues, kernel
 > > would end up reclaiming anon pages that are hotter (in "absolute" LRU
 > > order) than some file pages.
 > 
 > That is why we check the fraction of referenced pages in each
 > queue.  Please look at the get_scan_ratio() and shrink_zone()
 > code in my patch.

Probably I am missing something, but I don't see how that can help. For
example, suppose (for simplicity) that we have swappiness of 100%, and
that fraction of referenced anon pages gets slightly less than of file
pages. get_scan_ratio() increases anon_percent, and shrink_zone() starts
scanning anon queue more aggressively. As a result, pages spend less
time there, and have less chance of ever being accessed, reducing
fraction of referenced anon pages further, and triggering further
increase in the amount of scanning, etc. Doesn't this introduce positive
feed-back loop?

Nikita.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] split file and anonymous page queues #3

2007-03-21 Thread Nikita Danilov
Rik van Riel writes:
 > [ OK, I suck.  I edited yesterday's email with the new info, but forgot
 >to change the attachment to today's patch.  Here is today's patch. ]
 > 
 > Split the anonymous and file backed pages out onto their own pageout
 > queues.  This we do not unnecessarily churn through lots of anonymous
 > pages when we do not want to swap them out anyway.

Won't this re-introduce problems similar to ones due to split
inactive_clean/inactive_dirty queues we had in the past?

For example, by rotating anon queues faster than file queues, kernel
would end up reclaiming anon pages that are hotter (in "absolute" LRU
order) than some file pages.

Nikita.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/3] VM throttling: avoid blocking occasional writers

2007-02-24 Thread Nikita Danilov
Tomoki Sekiyama writes:
 > Hi,

Hello,

 > 

[...]

 > 
 > While Dirty+Writeback pages get more than 40% of memory, process-B is
 > blocked in balance_dirty_pages() until writeback of some (`write_chunk',
 > typically = 1536) dirty pages on disk-b is started.

May be the simpler solution is to use separate variables to control
ratelimit and write chunk?

writeback_set_ratelimit() adjusts ratelimit_pages to avoid too frequent
calls to balance_dirty_pages(), but once we are inside of
writeback_inodes(), there is no need to write especially many pages in
one go: overhead of any additional looping is negligible, when compared
with the cost of writing.

Speaking of which, now that expensive get_writeback_state() is gone from
page-writeback.c why do we need adjustable ratelimiting at all? It looks
like writeback_set_ratelimit() can be dropped, and fixed ratelimit used
instead.

Nikita.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 7/8] Exclude unreclaimable pages from dirty ration calculation

2007-01-18 Thread Nikita Danilov
Christoph Lameter writes:
 > Consider unreclaimable pages during dirty limit calculation
 > 
 > Tracking unreclaimable pages helps us to calculate the dirty ratio
 > the right way. If a large number of unreclaimable pages are allocated
 > (through the slab or through huge pages) then write throttling will
 > no longer work since the limit cannot be reached anymore.
 > 
 > So we simply subtract the number of unreclaimable pages from the pages
 > considered for writeout threshold calculation.
 > 
 > Other code that allocates significant amounts of memory for device
 > drivers etc could also be modified to take advantage of this functionality.

I think that simpler solution of this problem is to use only potentially
reclaimable pages (that is, active, inactive, and free pages) to
calculate writeout threshold. This way there is no need to maintain
counters for unreclaimable pages. Below is a patch implementing this
idea, it got some testing.

Nikita.

Fix write throttling to calculate its thresholds from amount of memory that
can be consumed by file system and swap caches, rather than from the total
amount of physical memory. This avoids situations (among other things) when
memory consumed by kernel slab allocator prevents write throttling from ever
happening.

Signed-off-by: Nikita Danilov <[EMAIL PROTECTED]>

 mm/page-writeback.c |   33 -
 1 files changed, 24 insertions(+), 9 deletions(-)

Index: git-linux/mm/page-writeback.c
===
--- git-linux.orig/mm/page-writeback.c
+++ git-linux/mm/page-writeback.c
@@ -101,6 +101,18 @@ EXPORT_SYMBOL(laptop_mode);
 
 static void background_writeout(unsigned long _min_pages);
 
+/* Maximal number of pages that can be consumed by pageable caches. */
+static unsigned long total_pageable_pages(void)
+{
+   unsigned long active;
+   unsigned long inactive;
+   unsigned long free;
+
+   get_zone_counts(&active, &inactive, &free);
+   /* +1 to never return 0. */
+   return active + inactive + free + 1;
+}
+
 /*
  * Work out the current dirty-memory clamping and background writeout
  * thresholds.
@@ -127,22 +139,31 @@ get_dirty_limits(long *pbackground, long
int unmapped_ratio;
long background;
long dirty;
-   unsigned long available_memory = vm_total_pages;
+   unsigned long total_pages;
+   unsigned long available_memory;
struct task_struct *tsk;
 
+   available_memory = total_pages = total_pageable_pages();
+
 #ifdef CONFIG_HIGHMEM
/*
 * If this mapping can only allocate from low memory,
 * we exclude high memory from our count.
 */
-   if (mapping && !(mapping_gfp_mask(mapping) & __GFP_HIGHMEM))
+   if (mapping && !(mapping_gfp_mask(mapping) & __GFP_HIGHMEM)) {
+   if (available_memory > totalhigh_pages)
available_memory -= totalhigh_pages;
+   else
+   available_memory = 1;
+   }
 #endif
 
 
unmapped_ratio = 100 - ((global_page_state(NR_FILE_MAPPED) +
global_page_state(NR_ANON_PAGES)) * 100) /
-   vm_total_pages;
+   total_pages;
+   if (unmapped_ratio < 0)
+   unmapped_ratio = 0;
 
dirty_ratio = vm_dirty_ratio;
if (dirty_ratio > unmapped_ratio / 2)
@@ -513,7 +534,7 @@ void laptop_sync_completion(void)
 
 void writeback_set_ratelimit(void)
 {
-   ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
+   ratelimit_pages = total_pageable_pages() / (num_online_cpus() * 32);
if (ratelimit_pages < 16)
ratelimit_pages = 16;
if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
@@ -542,7 +563,7 @@ void __init page_writeback_init(void)
long buffer_pages = nr_free_buffer_pages();
long correction;
 
-   correction = (100 * 4 * buffer_pages) / vm_total_pages;
+   correction = (100 * 4 * buffer_pages) / total_pageable_pages();
 
if (correction < 100) {
dirty_background_ratio *= correction;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Finding hardlinks

2007-01-04 Thread Nikita Danilov
Mikulas Patocka writes:
 > > > BTW. How does ReiserFS find that a given inode number (or object ID in
 > > > ReiserFS terminology) is free before assigning it to new file/directory?
 > >
 > > reiserfs v3 has an extent map of free object identifiers in
 > > super-block.
 > 
 > Inode free space can have at most 2^31 extents --- if inode numbers 
 > alternate between "allocated", "free". How do you pack it to superblock?

In the worst case, when free/used extents are small, some free oids are
"leaked", but this has never been problem in practice. In fact, there
was a patch for reiserfs v3 to store this map in special hidden file but
it wasn't included in mainline, as nobody ever complained about oid map
fragmentation.

 > 
 > > reiser4 used 64 bit object identifiers without reuse.
 > 
 > So you are going to hit the same problem as I did with SpadFS --- you 
 > can't export 64-bit inode number to userspace (programs without 
 > -D_FILE_OFFSET_BITS=64 will have stat() randomly failing with EOVERFLOW 
 > then) and if you export only 32-bit number, it will eventually wrap-around 
 > and colliding st_ino will cause data corruption with many userspace 
 > programs.

Indeed, this is fundamental problem. Reiser4 tries to ameliorate it by
using hash function that starts colliding only when there are billions
of files, in which case 32bit inode number is screwed anyway.

Note, that none of the above problems invalidates reasons for having
long in-kernel inode identifiers that I outlined in other message.

 > 
 > Mikulas

Nikita.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Finding hardlinks

2007-01-01 Thread Nikita Danilov
Mikulas Patocka writes:

[...]

 > 
 > BTW. How does ReiserFS find that a given inode number (or object ID in 
 > ReiserFS terminology) is free before assigning it to new file/directory?

reiserfs v3 has an extent map of free object identifiers in
super-block. reiser4 used 64 bit object identifiers without reuse.

 > 
 > Mikulas

Nikita.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Finding hardlinks

2006-12-31 Thread Nikita Danilov
Mikulas Patocka writes:
 > 
 > 
 > On Fri, 29 Dec 2006, Trond Myklebust wrote:
 > 
 > > On Thu, 2006-12-28 at 19:14 +0100, Mikulas Patocka wrote:
 > >> Why don't you rip off the support for colliding inode number from the
 > >> kernel at all (i.e. remove iget5_locked)?
 > >>
 > >> It's reasonable to have either no support for colliding ino_t or full
 > >> support for that (including syscalls that userspace can use to work with
 > >> such filesystem) --- but I don't see any point in having half-way support
 > >> in kernel as is right now.
 > >
 > > What would ino_t have to do with inode numbers? It is only used as a
 > > hash table lookup. The inode number is set in the ->getattr() callback.
 > 
 > The question is: why does the kernel contain iget5 function that looks up 
 > according to callback, if the filesystem cannot have more than 64-bit 
 > inode identifier?

Generally speaking, file system might have two different identifiers for
files:

 - one that makes it easy to tell whether two files are the same one;

 - one that makes it easy to locate file on the storage.

According to POSIX, inode number should always work as identifier of the
first class, but not necessary as one of the second. For example, in
reiserfs something called "a key" is used to locate on-disk inode, which
in turn, contains inode number. Identifiers of the second class tend to
live in directory entries, and during lookup we want to consult inode
cache _before_ reading inode from the disk (otherwise cache is mostly
useless), right? This means that some file systems want to index inodes
in a cache by something different than inode number.

There is another reason, why I, personally, would like to have an
ability to index inodes by things other than inode numbers: delayed
inode number allocation. Strictly speaking, file system has to assign
inode number to the file only when it is just about to report it to the
user space (either though stat, or, ugh... readdir). If location of
inode on disk depends on its inode number (like it is in inode-table
based file systems like ext[23]) then delayed inode number allocation
has to same advantages as delayed block allocation.

 > 
 > This lookup callback just induces writing bad filesystems with coliding 
 > inode numbers. Either remove coda, smb (and possibly other) filesystems 
 > from the kernel or make a proper support for userspace for them.
 > 
 > The situation is that current coreutils 6.7 fail to recursively copy 
 > directories if some two directories in the tree have coliding inode 
 > number, so you get random data corruption with these filesystems.
 > 
 > Mikulas

Nikita.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield() makes OpenLDAP slow

2005-08-21 Thread Nikita Danilov
Howard Chu writes:
 > Lee Revell wrote:
 > >  On Sat, 2005-08-20 at 11:38 -0700, Howard Chu wrote:
 > > > But I also found that I needed to add a new yield(), to work around
 > > > yet another unexpected issue on this system - we have a number of
 > > > threads waiting on a condition variable, and the thread holding the
 > > > mutex signals the var, unlocks the mutex, and then immediately
 > > > relocks it. The expectation here is that upon unlocking the mutex,
 > > > the calling thread would block while some waiting thread (that just
 > > > got signaled) would get to run. In fact what happened is that the
 > > > calling thread unlocked and relocked the mutex without allowing any
 > > > of the waiting threads to run. In this case the only solution was
 > > > to insert a yield() after the mutex_unlock().
 > >
 > >  That's exactly the behavior I would expect.  Why would you expect
 > >  unlocking a mutex to cause a reschedule, if the calling thread still
 > >  has timeslice left?
 >
 > That's beside the point. Folks are making an assertion that
 > sched_yield() is meaningless; this example demonstrates that there are
 > cases where sched_yield() is essential.

It is not essential, it is non-portable.

Code you described is based on non-portable "expectations" about thread
scheduling. Linux implementation of pthreads fails to satisfy
them. Perfectly reasonable. Code is then "fixed" by adding sched_yield()
calls and introducing more non-portable assumptions. Again, there is no
guarantee this would work on any compliant implementation.

While "intuitive" semantics of sched_yield() is to yield CPU and to give
other runnable threads their chance to run, this is _not_ what standard
prescribes (for non-RT threads).

 >
 > --
 >   -- Howard Chu

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield() makes OpenLDAP slow

2005-08-20 Thread Nikita Danilov
Howard Chu writes:
 > Nikita Danilov wrote:
 > > That returns us to the core of the problem: sched_yield() is used to
 > > implement a synchronization primitive and non-portable assumptions are
 > > made about its behavior: SUS defines that after sched_yield() thread
 > > ceases to run on the CPU "until it again becomes the head of its thread
 > > list", and "thread list" discipline is only defined for real-time
 > > scheduling policies. E.g., 
 > >
 > > int sched_yield(void)
 > > {
 > >return 0;
 > > }
 > >
 > > and
 > >
 > > int sched_yield(void)
 > > {
 > >sleep(100);
 > >return 0;
 > > }
 > >
 > > are both valid sched_yield() implementation for non-rt (SCHED_OTHER)
 > > threads.
 > I think you're mistaken:
 > http://groups.google.com/group/comp.programming.threads/browse_frm/thread/0d4eaf3703131e86/da051ebe58976b00#da051ebe58976b00
 > 
 > sched_yield() is required to be supported even if priority scheduling is 
 > not supported, and it is required to cause the calling thread (not 
 > process) to yield the processor.

Of course sched_yield() is required to be supported, the question is for
how long CPU is yielded. Here is the quote from the SUS (actually the
complete definition of sched_yield()):

The sched_yield() function shall force the running thread to
relinquish the processor until it again becomes the head of its
thread list.

As far as I can see, SUS doesn't specify how "thread list" is maintained
for non-RT scheduling policy, and implementation that immediately places
SCHED_OTHER thread that called sched_yield() back at the head of its
thread list is perfectly valid. Also valid is an implementation that
waits for 100 seconds and then places sched_yield() caller to the head
of the list, etc. Basically, while semantics of sched_yield() are well
defined for RT scheduling policy, for SCHED_OTHER policy standard leaves
it implementation defined.

 > 
 > -- 
 >   -- Howard Chu

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield() makes OpenLDAP slow

2005-08-20 Thread Nikita Danilov
Howard Chu writes:
 > Nikita Danilov wrote:

[...]

 > 
 > >  What prevents transaction monitor from using, say, condition
 > >  variables to "yield cpu"? That would have an additional advantage of
 > >  blocking thread precisely until specific event occurs, instead of
 > >  blocking for some vague indeterminate load and platform dependent
 > >  amount of time.
 > 
 > Condition variables offer no control over which thread is waken up. 

When only one thread waits on a condition variable, which is exactly a
scenario involved, --sorry if I weren't clear enough-- condition signal
provides precise control over which thread is woken up.

 > We're wandering into the design of the SleepyCat BerkeleyDB library 
 > here, and we don't exert any control over that either. BerkeleyDB 
 > doesn't appear to use pthread condition variables; it seems to construct 
 > its own synchronization mechanisms on top of mutexes (and yield calls). 

That returns us to the core of the problem: sched_yield() is used to
implement a synchronization primitive and non-portable assumptions are
made about its behavior: SUS defines that after sched_yield() thread
ceases to run on the CPU "until it again becomes the head of its thread
list", and "thread list" discipline is only defined for real-time
scheduling policies. E.g., 

int sched_yield(void)
{
   return 0;
}

and

int sched_yield(void)
{
   sleep(100);
   return 0;
}

are both valid sched_yield() implementation for non-rt (SCHED_OTHER)
threads.

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched_yield() makes OpenLDAP slow

2005-08-19 Thread Nikita Danilov
Howard Chu <[EMAIL PROTECTED]> writes:

[...]

> concurrency. It is the nature of such a system to encounter deadlocks
> over the normal course of operations. When a deadlock is detected, some
> thread must be chosen (by one of a variety of algorithms) to abort its
> transaction, in order to allow other operations to proceed to
> completion. In this situation, the chosen thread must get control of the
> CPU long enough to clean itself up,

What prevents transaction monitor from using, say, condition variables
to "yield cpu"? That would have an additional advantage of blocking
thread precisely until specific event occurs, instead of blocking for
some vague indeterminate load and platform dependent amount of time.

> and then it must yield the CPU in
> order to allow any other competing threads to complete their
> transaction.

Again, this sounds like thing doable with standard POSIX synchronization
primitives.

>
> -- 
>   -- Howard Chu

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] fix race in __block_prepare_write (again)

2005-04-22 Thread Nikita Danilov
Anton Altaparmakov writes:

[...]

 > 
 > mm/filemap.c::file_buffered_write():
 > 
 > - It calls fault_in_pages_readable() which is completely bogus if
 > @nr_segs > 1.  It needs to be replaced by a to be written
 > "fault_in_pages_readable_iovec()".

Which will be only marginally less bogus, because page(s) can be evicted
from the memory between fault_in_pages_readable*() and
__grab_cache_page() anyway.

[...]

 > Best regards,
 > 
 > Anton

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Info Regarding MCR tool

2005-04-06 Thread Nikita Danilov
karthik <[EMAIL PROTECTED]> writes:

> Hi,
>
>   If anybody is having any idea of what is MCR and what is its use, 
> just tell me. i think its some Monitor related software. But i want in more 
> detail of what is it and how is it working.

MCR stands for "Monitor Console Routine". Press Ctrl-C to get to the
"MCR>" prompt. As to "how is it working", look at the source:

http://www.bitsavers.org/bits/DEC/pdp15/dectapeImages/XVM_RSX/_textfiles/DEC-XV-IXRAA-A-UA5_02-28-77/

5 days or 28 years, I hope it is not too late.

>
> Karthik

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm counter operations through macros

2005-03-12 Thread Nikita Danilov
Christoph Lameter writes:
 > On Fri, 11 Mar 2005, Dave Jones wrote:
 > 
 > > Splitting this last one into inc_mm_counter() and dec_mm_counter()
 > > means you can kill off the last argument, and get some of the
 > > readability back. As it stands, I think this patch adds a bunch
 > > of obfuscation for no clear benefit.
 > 
 > Ok.
 > -
 > This patch extracts all the operations on counters protected by the
 > page table lock (currently rss and anon_rss) into definitions in
 > include/linux/sched.h. All rss operations are performed through
 > the following macros:
 > 
 > get_mm_counter(mm, member)   -> Obtain the value of a counter
 > set_mm_counter(mm, member, value)-> Set the value of a counter
 > update_mm_counter(mm, member, value) -> Add to a counter

A nitpick, but wouldn't be it clearer to call it add_mm_counter()? As an
additional bonus this matches atomic_{inc,dec,add}() and makes macro
names more uniform.

 > inc_mm_counter(mm, member)   -> Increment a counter
 > dec_mm_counter(mm, member)   -> Decrement a counter

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6.11-rc5-mm1 patch] reiser4 Kconfig help cleanup

2005-03-02 Thread Nikita Danilov
Andrew Morton <[EMAIL PROTECTED]> writes:

> Jes Sorensen <[EMAIL PROTECTED]> wrote:
>>

[...]

>> 
>> [EMAIL PROTECTED] linux-2.6.11-rc5-mm1]$ grep PG_arch fs/reiser4/*.c
>> fs/reiser4/page_cache.c:   page_flag_name(page, PG_arch_1),
>> fs/reiser4/txnmgr.c:assert("vs-1448", 
>> test_and_clear_bit(PG_arch_1, &node->pg->flags));
>> fs/reiser4/txnmgr.c:ON_DEBUG(set_bit(PG_arch_1, 
>> &(copy->pg)->flags));
>> 
>> Someone was obviously smoking something illegal, what part of 'arch'
>> did she/he not understand? I assume we can request this is fixed by
>> the patch owner asap.
>> 
>
> Could the reiserfs team please comment?
>
> If it's just debug then probably it would be better to add a new flag.
>

Yes, this is debugging. I believe it can be removed now.

> If these pages are never mmapped then it'll just happen to work, I guess. 
> But a filesystem really shouldn't be dinking with PG_arch_1.

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


patch to fs/proc/base.c

2001-07-20 Thread Nikita Danilov

Hello, 

following patch cures oopses in 2.4.7-pre9 when 
proc_pid_make_inode() is called on task with task->mm == NULL.

Linus, please apply, if you haven't got a bunch of equivalent patches
already, which is doubtful.

Nikita.

--- linux-2.4.7-pre9/fs/proc/base.cFri Jul 20 14:57:55 2001
+++ linux-2.4.7-pre9.patched/fs/proc/base.c Fri Jul 20 17:03:23 2001
@@ -670,7 +670,7 @@ static struct inode *proc_pid_make_inode
inode->u.proc_i.task = task;
inode->i_uid = 0;
inode->i_gid = 0;
-   if (ino == PROC_PID_INO || task->mm->dumpable) {
+   if (ino == PROC_PID_INO || (task->mm && task->mm->dumpable)) {
inode->i_uid = task->euid;
inode->i_gid = task->egid;
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/