Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-22 Thread Simon Kirby
On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote:

> On 11/22/2016 05:06 PM, Marc MERLIN wrote:
> > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
> >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
>  4.9rc5 however seems to be doing better, and is still running after 18
>  hours. However, I got a few page allocation failures as per below, but 
>  the
>  system seems to recover.
>  Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 
>  days) 
>  or is that good enough, and i should go back to 4.8.8 with that patch 
>  applied?
>  https://marc.info/?l=linux-mm&m=147423605024993
> >>>
> >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying
> >>> 4.8 with that patch, yeah.
> >>
> >> So the good news is that it's been running for almost 5H and so far so 
> >> good.
> > 
> > And the better news is that the copy is still going strong, 4.4TB and
> > going. So 4.8.8 is fixed with that one single patch as far as I'm
> > concerned.
> > 
> > So thanks for that, looks good to me to merge.
> 
> Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> already EOL AFAICS).
> 
> - send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
>   - alternatively a simpler (againm 4.8-only) patch that just outright
> prevents OOM for 0 < order < costly, as Michal already suggested.
> - backport 10+ compaction patches to 4.8 stable
> - something else?
> 
> Michal? Linus?
> 
> [1] https://marc.info/?l=linux-mm&m=147423605024993

Sorry for my molasses rate of feedback. I found a workaround, setting
vm/watermark_scale_factor to 500, and threw that in sysctl. This was on
the MythTV box that OOMs everything after about a day on 4.8 otherwise.

I've been running [1] for 9 days on it (4.8.4 + [1]) without issue, but
just realized I forgot to remove the watermark_scale_factor workaround.
I've restored that now, so I'll see if it becomes unhappy by tomorrow.

I also threw up a few other things you had asked for (vmstat, zoneinfo
before and after the first OOM on 4.8.4): http://0x.ca/sim/ref/4.8.4/
(that was before booting into a rebuild with [1] applied)

Simon-


Re: Hung task detector versus NFS (TASK_KILLABLE)

2016-03-09 Thread Simon Kirby
On Mon, Mar 07, 2016 at 07:11:19PM -0800, Andi Kleen wrote:

> > I write this because I would actually find it useful to see the original
> > backtrace, even if it is interruptible, not just the collateral damage.
> > Since the "skipping" of NFS is basically incomplete anyway, how big a
> > deal is this "feature"?
> 
> Random backtrace spewing is always a misfeature for 99.99+% of the users
> for whom it is gibberish.

Distributions all seem to ship with it on because apparently some people
can read it. There was even discussion that the default 10 is not enough.

> If you really need it yourself add a kprobe.

To emulate a hung task backtrace even when TASK_KILLABLE? That sounds
like some hoop-jumping, but I don't know kprobes.

I'm just saying the current "NFS filter" is broken ("cat a" twice), but
this really will make more noise for people (in cases where NFS is stuck
for minutes), I guess I'll just sit in a corner with that line changed in
my tree.

Simon-


Hung task detector versus NFS (TASK_KILLABLE)

2016-03-07 Thread Simon Kirby
Hello!

Back in 2008, you committed 316d9679f33caf7e683471647d1472bfe133d858
which changed softlockup.c (now moved to hung_task.c) to avoid logging a
spew of soft lockup warnings when the Ethernet cable is unplugged with
active NFS mounts.

Meanwhile, I've been seeing hung task warnings like this for years, so I
wondered what the deal is. It seems there are VFS paths that can enter
uninterruptible sleep as result of locks held in interruptible sleep.

For example, I can reproduce hung task warnings by firewalling NFS, then
"cat a" twice: the second hangs in mutex_lock() from path_openat(), which
then spews a hung task warning.

I write this because I would actually find it useful to see the original
backtrace, even if it is interruptible, not just the collateral damage.
Since the "skipping" of NFS is basically incomplete anyway, how big a
deal is this "feature"?

Would anybody object if we just returned this to anything blocked?

The lines in question these days are here in kernel/hung_task.c:

/* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */
if (t->state == TASK_UNINTERRUPTIBLE)
check_hung_task(t, timeout);

It used to be t->state & TASK_UNINTERRUPTIBLE.

Simon-


Re: Dirty pages underflow on 3.14.23

2015-01-07 Thread Simon Kirby
On Wed, Jan 07, 2015 at 10:48:10PM +0100, Vlastimil Babka wrote:

> On 01/07/2015 10:28 PM, Simon Kirby wrote:
>
> > Hmm...A possibly-related issue...Before trying this, after a fresh boot,
> > /proc/vmstat showed:
> > 
> > nr_alloc_batch 4294541205
> 
> This can happen, and not be a problem in general. However, there was a fix
> abe5f972912d086c080be4bde67750630b6fb38b in 3.17 for a potential performance
> issue if this counter overflows on single processor configuration. It was 
> marked
> stable, but the 3.16 series was discontinued before the fix could be 
> backported.
> So if you are on single-core, you might hit the performance issue.

That particular commit seems to just change the code path in that case,
but should it be underflowing at all on UP?

> > Still, nr_alloc_batch reads as 4294254379 after MySQL restart, and now
> > seems to stay up there.
> 
> Hm if it stays there, then you are probably hitting the performance issue. 
> Look
> at /proc/zoneinfo, which zone has the underflow. It means this zone will get
> unfair amount of allocations, while others may contain stale data and would be
> better candidates.

In this case, it has only 640MB, and there's only DMA and Normal. This is
affecting Normal, and DMA is so small that it probably doesn't matter.

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Dirty pages underflow on 3.14.23

2015-01-07 Thread Simon Kirby
On Wed, Jan 07, 2015 at 10:57:46AM +, Holger Hoffst?tte wrote:

> On Tue, 06 Jan 2015 12:54:43 -0500, Mikulas Patocka wrote:
> 
> > I can't reprodce it. It happened just once.
> > 
> > That patch is supposed to fix an occasional underflow by a single page -
> > while my meminfo showed underflow by 22952KiB (5738 pages).
> 
> You are probably looking for:
> commit 835f252c6debd204fcd607c79975089b1ecd3472
> "aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer"
> 
> It definitely went into 3.14.26, don't know about 3.16.x.

I can confirm that a MySQL shutdown/restart triggers it for me, even
immediately following a fresh boot:

# uname -a ; grep '^nr_dirty ' /proc/vmstat; /etc/init.d/mysql restart; \
 grep '^nr_dirty ' /proc/vmstat
Linux blue 3.16.6-blue #51 Mon Oct 20 14:00:47 PDT 2014 i686 GNU/Linux
nr_dirty 13
[ ok ] Stopping MySQL database server: mysqld.
[ ok ] Starting MySQL database server: mysqld . ..
[info] Checking for tables which need an upgrade, are corrupt or were not 
closed cleanly..
nr_dirty 4294967245

Hmm...A possibly-related issue...Before trying this, after a fresh boot,
/proc/vmstat showed:

nr_alloc_batch 4294541205

and after the restart, it shows:

nr_alloc_batch 161

...anyway, git cherry-pick ce4b66be6cd964e84363afd4a603633dd061b3b8 on
3.16.6 tree does seem to fix nr_dirty from underflowing...Yay!

Still, nr_alloc_batch reads as 4294254379 after MySQL restart, and now
seems to stay up there.

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Dirty pages underflow on 3.14.23

2015-01-07 Thread Simon Kirby
On Mon, Jan 05, 2015 at 06:05:59PM -0500, Mikulas Patocka wrote:

> Hi
> 
> I would like to report a memory management bug where the dirty pages count 
> underflowed.

Hello!

I've been hitting this problem for a while now. I've seen it on:

3.12.9
3.14.4
3.16
3.16.6

When it occurs, /proc/vmstat shows nr_dirty values such as:

nr_dirty 4294967031 (3.12.9)
nr_dirty 4294967251 (3.16.6)

No other counters appear to be negative or have wrapped in 32 bits, and
/proc/meminfo is similar as with your report. See proc file copies and
.config here: http://0x.ca/sim/ref/3.16.6-blue/ (hosting box is this one)

> It happened after some time that the Dirty pages count underflowed, as can 
> be seen in /proc/meminfo. The underflow condition was persistent, 
> /proc/meminfo was showing the big value even when the system was 
> completely idle. The counter never returned to zero.
> 
> The system didn't crash, but it became very slow - because of the big 
> value in the "Dirty" field, lazy writing was not working anymore, any 
> process that created a dirty page triggered immediate writeback, which 
> slowed down the system very much. The only fix was to reboot the machine.

This is also the case with me, although each time it occurs it seems to
be when I'm running apt-get upgrade to apply updates. Today, it occurred
on 3.16.6 as I started an "apt-get update". It is still possible to dirty
new pages and make some progress, but it becomes unusably slow. It ends
up writing the same blocks forever (from blktrace | grep D);

 33,00 2776 1.220890482 20335  D   W 43765671 + 8 [kworker/u2:0]
 33,00 2783 1.221073198 20335  D   W 7439223 + 8 [kworker/u2:0]
 33,00 2791 1.224824452 20335  D   W 43765671 + 8 [kworker/u2:0]
 33,00 2800 1.232559686 20335  D   W 7439223 + 8 [kworker/u2:0]

> The kernel version where this happened is 3.14.23. The kernel is compiled 
> without SMP and with peemption. The system is single-core 32-bit x86.

Same. The only other oddity to note is that the IDE driver is still
enabled in my case; root is on /dev/md6 which is a RAID 1 of hde1, hdg1.

> I see that 3.14.24 containes some fix for underflow (commit 
> 6619741f17f541113a02c30f22a9ca22e32c9546, upstream commit 
> abe5f972912d086c080be4bde67750630b6fb38b), but it doesn't seem that that 
> commit fixes this condition. If you have a commit that could fix this, say 
> it.

That doesn't seem to have made it to 3.16.6, but it sounds like a
fairness thing more than a race fix. Vlastimil pointed at this as
possibly useful:

http://ozlabs.org/~akpm/mmots/broken-out/mm-protect-set_page_dirty-from-ongoing-truncation.patch

...but I can't reproduce this immediately. So far, I have to
forget about it for a while, then do an apt-get upgrade.

> MemTotal: 253504 kB

MemTotal: 639396 kB

640MB should be enough for anybody. :)

Hmm, just tried to shut it down as cleanly as possible with sysrq-s,
sysrq-u, and got:

SysRq : Emergency Sync
Emergency Sync complete
SysRq : Emergency Remount R/O
[ cut here ]
WARNING: CPU: 0 PID: 24535 at fs/ext3/inode.c:1590 
ext3_ordered_writepage+0x7c/0x240()
Modules linked in: xt_recent ts_kmp xt_string nfnetlink_log e100 xt_hashlimit 
xt_state xt_REDIRECT nf_conntrack_ftp iptable_nat nf_conntrack_ipv4 
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack [last unloaded: xt_recent]
CPU: 0 PID: 24535 Comm: kworker/u2:0 Not tainted 3.16.6-blue #51
Hardware name: MICRO-STAR INTERNATIONAL CO., LTD MS-6330/MS-6330, BIOS 6.00 PG 
06/15/2001
Workqueue: writeback bdi_writeback_workfn (flush-9:6)
   dd3f5c70 c16091f1 dd3f5ca0 c103231b c17bf9ac 
 5fd7 c17e8a3b 0636 c115db7c c115db7c e7c3d140  e4d673d0
 dd3f5cb0 c103235d 0009  dd3f5cd8 c115db7c dd3f5cc8 c10fe13c
Call Trace:
 [] dump_stack+0x16/0x18
 [] warn_slowpath_common+0x7b/0xa0
 [] ? ext3_ordered_writepage+0x7c/0x240
 [] ? ext3_ordered_writepage+0x7c/0x240
 [] warn_slowpath_null+0x1d/0x20
 [] ext3_ordered_writepage+0x7c/0x240
 [] ? __set_page_dirty_buffers+0xc/0x90
 [] __writepage+0xb/0x30
 [] ? mapping_tagged+0x10/0x10
 [] write_cache_pages+0x161/0x3a0
 [] ? blk_finish_plug+0xd/0x30
 [] ? mapping_tagged+0x10/0x10
 [] generic_writepages+0x2f/0x60
 [] do_writepages+0x35/0x40
 [] __writeback_single_inode+0x3b/0x1e0
 [] writeback_sb_inodes+0x160/0x2e0
 [] __writeback_inodes_wb+0x6c/0xa0
 [] wb_writeback+0x1a2/0x240
 [] bdi_writeback_workfn+0x149/0x370
 [] process_one_work+0xef/0x310
 [] worker_thread+0xe8/0x410
 [] ? mod_delayed_work_on+0x60/0x60
 [] ? mod_delayed_work_on+0x60/0x60
 [] kthread+0x95/0xb0
 [] ret_from_kernel_thread+0x20/0x30
 [] ? __kthread_parkme+0x60/0x60
---[ end trace ca1dc42be1a0b8e5 ]---
EXT4-fs (md7): re-mounted. Opts: (null)
EXT4-fs (md2): re-mounted. Opts: (null)
Emergency Remount complete
EXT4-fs (md2): ext4_writepages: jbd2_start: 1024 pages, ino 9438915; err -30
EXT4-fs (md2): ext4_writepages: jbd2_start: 1024 pages, ino 9438915; err -30

Re: net_ns cleanup / RCU overhead

2014-08-28 Thread Simon Kirby
On Thu, Aug 28, 2014 at 01:46:58PM -0700, Paul E. McKenney wrote:

> On Thu, Aug 28, 2014 at 03:33:42PM -0500, Eric W. Biederman wrote:
> 
> > I just want to add a little bit more analysis to this.
> > 
> > What we desire to be fast is the copy_net_ns, cleanup_net is batched and
> > asynchronous which nothing really cares how long it takes except that
> > cleanup_net holds the net_mutex and thus blocks copy_net_ns.
> > 
> > The puzzle is why and which rcu delays Simon is seeing in the network
> > namespace cleanup path, as it seems like the synchronize_rcu is not
> > the only one, and in the case of vsftp with trivail network namespaces
> > where nothing has been done we should not need to delay.
> 
> Indeed, given the version and .config, I can't see why any individual
> RCU grace-period operation would be particularly slow.
> 
> I suggest using ftrace on synchronize_rcu() and friends.

I made a parallel net namespace create/destroy benchmark that prints the
progress and time to create and cleanup 32 unshare()d child processes:

http://0x.ca/sim/ref/tools/netnsbench.c

I noticed that if I haven't run it for a while, the first batch often is
fast, followed by slowness from then on:

 0.039478s
-++-++-- 4.463837s
+--+++-- 3.011882s
+++---+- 2.283993s

Fiddling around on a stock kernel, "echo 1 > /sys/kernel/rcu_expedited"
makes behaviour change as it did with my patch:

++-++-+++-+-+-+-++-+-++--++-+--+-+-++--++-+-+-+-++-+--++ 0.801406s
+-+-+-++-+-+-+-+-++--+-+-++-+--++-+-+-+-+-+-+-+-+-+-+-+--++-+--- 0.872011s
++--+-++--+-++--+-++--+-+-+-+-++-+--++--+-++-+-+-+-+--++-+-+-+-- 0.946745s

How would I use ftrace on synchronize_rcu() here?

As Eric said, cleanup_net() is batched, but while it is cleaning up,
net_mutex is held. Isn't the issue just that net_mutex is held while
some other things are going on that are meant to be lazy / batched?

What is net_mutex protecting in cleanup_net()?

I noticed that [kworker/u16:0]'s stack is often:

[] wait_rcu_gp+0x46/0x50
[] synchronize_sched+0x2e/0x50
[] nf_nat_net_exit+0x2c/0x50 [nf_nat]
[] ops_exit_list.isra.4+0x39/0x60
[] cleanup_net+0xf0/0x1a0
[] process_one_work+0x157/0x440
[] worker_thread+0x63/0x520
[] kthread+0xd6/0xf0
[] ret_from_fork+0x7c/0xb0
[] 0x

and

[] _rcu_barrier+0x154/0x1f0
[] rcu_barrier+0x10/0x20
[] kmem_cache_destroy+0x6c/0xb0
[] nf_conntrack_cleanup_net_list+0x167/0x1c0 [nf_conntrack]
[] nf_conntrack_pernet_exit+0x65/0x70 [nf_conntrack]
[] ops_exit_list.isra.4+0x53/0x60
[] cleanup_net+0xf0/0x1a0
[] process_one_work+0x157/0x440
[] worker_thread+0x63/0x520
[] kthread+0xd6/0xf0
[] ret_from_fork+0x7c/0xb0
[] 0x

So I tried flushing iptables rules and rmmoding netfilter bits:

-++++--- 0.179940s
++--+-+- 0.151988s
---+--+++--- 0.159967s
++--++-- 0.175964s

Expedited:

++-+--++-+-+-+-+-+-+--++-+-+-++-+-+-+--++-+-+-+-+-+-+-+-+-+-+--- 0.079988s
++-+-+-+-+-+-+-+-+-+-+-+--++-+--++-+--+-++-+-+--++-+-+-+-+-+-+-- 0.089347s
--+++--++--+-+++-+--++-+-+--++-+-+--++-- 0.081566s
+-+++---++-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-+-+-+-+-+-+--- 0.089026s

So, much faster. It seems that just loading nf_conntrack_ipv4 (like by
running iptables -t nat -nvL) is enough to slow it way down. But it is
still capable of being fast, as above.

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: net_ns cleanup / RCU overhead

2014-08-28 Thread Simon Kirby
On Thu, Aug 28, 2014 at 12:24:31PM -0700, Paul E. McKenney wrote:

> On Tue, Aug 19, 2014 at 10:58:55PM -0700, Simon Kirby wrote:
> > Hello!
> > 
> > In trying to figure out what happened to a box running lots of vsftpd
> > since we deployed a CONFIG_NET_NS=y kernel to it, we found that the
> > (wall) time needed for cleanup_net() to complete, even on an idle box,
> > can be quite long:
> > 
> > #!/bin/bash
> > 
> > ip netns delete test >&/dev/null
> > while ip netns add test; do
> > echo hi
> > ip netns delete test
> > done
> > 
> > On my desktop and typical hosts, this prints at only around 4 or 6 per
> > second. While this is happening, "vmstat 1" reports 100% idle, and there
> > there are D-state processes with stacks similar to:
> > 
> > 30566 [kworker/u16:1] D wait_rcu_gp+0x48, synchronize_sched+0x2f, 
> > cleanup_net+0xdb, process_one_work+0x175, worker_thread+0x119, 
> > kthread+0xbb, ret_from_fork+0x7c, 0x
> > 
> > 32220 ip  D copy_net_ns+0x68, create_new_namespaces+0xfc, 
> > unshare_nsproxy_namespaces+0x66, SyS_unshare+0x159, 
> > system_call_fastpath+0x16, 0x
> > 
> > copy_net_ns() is waiting on net_mutex which is held by cleanup_net().
> > 
> > vsftpd uses CLONE_NEWNET to set up privsep processes. There is a comment
> > about it being really slow before 2.6.35 (it avoids CLONE_NEWNET in that
> > case). I didn't find anything that makes 2.6.35 any faster, but on Debian
> > 2.6.36-5-amd64, I notice it does seem to be a bit faster than 3.2, 3.10,
> > 3.16, though still not anything I'd ever want to rely on per connection.
> > 
> > C implementation of the above: http://0x.ca/sim/ref/tools/netnsloop.c
> > 
> > Kernel stack "top": http://0x.ca/sim/ref/tools/pstack
> > 
> > What's going on here?
> 
> That is a bit slow for many configurations, but there are some exceptions.
> 
> So, what is your kernel's .config?

I was unable to find a config (or stock kernel) that was any different,
but here's the one we're using: http://0x.ca/sim/ref/3.10/config-3.10.53

How fast does the above test run for you?

We've been running with the attached, which has helped a little, but it's
still quite slow in our particular use case (vsftpd), and with the above
test. Should I enable RCU_TRACE or STALL_INFO with a low timeout or
something?

Simon-

-- >8 --
Subject: [PATCH] netns: use synchronize_rcu_expedited instead of
 synchronize_rcu

Similar to ef323088, with synchronize_rcu(), we are only able to create
and destroy about 4 or 7 net namespaces per second, which really puts a
dent in the performance of programs attempting to use CLONE_NEWNET for
privilege separation (vsftpd, chromium).
---
 net/core/net_namespace.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 85b6269..6dcb4b3 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -296,7 +296,7 @@ static void cleanup_net(struct work_struct *work)
 * This needs to be before calling the exit() notifiers, so
 * the rcu_barrier() below isn't sufficient alone.
 */
-   synchronize_rcu();
+   synchronize_rcu_expedited();
 
/* Run all of the network namespace exit methods */
list_for_each_entry_reverse(ops, &pernet_list, list)
-- 
1.7.10.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


net_ns cleanup / RCU overhead

2014-08-19 Thread Simon Kirby
Hello!

In trying to figure out what happened to a box running lots of vsftpd
since we deployed a CONFIG_NET_NS=y kernel to it, we found that the
(wall) time needed for cleanup_net() to complete, even on an idle box,
can be quite long:

#!/bin/bash

ip netns delete test >&/dev/null
while ip netns add test; do
echo hi
ip netns delete test
done

On my desktop and typical hosts, this prints at only around 4 or 6 per
second. While this is happening, "vmstat 1" reports 100% idle, and there
there are D-state processes with stacks similar to:

30566 [kworker/u16:1] D wait_rcu_gp+0x48, synchronize_sched+0x2f, 
cleanup_net+0xdb, process_one_work+0x175, worker_thread+0x119, kthread+0xbb, 
ret_from_fork+0x7c, 0x

32220 ip  D copy_net_ns+0x68, create_new_namespaces+0xfc, 
unshare_nsproxy_namespaces+0x66, SyS_unshare+0x159, system_call_fastpath+0x16, 
0x

copy_net_ns() is waiting on net_mutex which is held by cleanup_net().

vsftpd uses CLONE_NEWNET to set up privsep processes. There is a comment
about it being really slow before 2.6.35 (it avoids CLONE_NEWNET in that
case). I didn't find anything that makes 2.6.35 any faster, but on Debian
2.6.36-5-amd64, I notice it does seem to be a bit faster than 3.2, 3.10,
3.16, though still not anything I'd ever want to rely on per connection.

C implementation of the above: http://0x.ca/sim/ref/tools/netnsloop.c

Kernel stack "top": http://0x.ca/sim/ref/tools/pstack

What's going on here?

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races

2013-12-05 Thread Simon Kirby
On Wed, Dec 04, 2013 at 01:14:56PM -0800, Linus Torvalds wrote:

> The lock we're moving up isn't the lock that actually protects the
> whole allocation logic (it's the lock that then protects the pipe
> contents when a pipe is *used*). So it's a useless lock, and moving it
> up is a good idea regardless (because it makes the locks only protect
> the parts they are actually *supposed* to protect.
> 
> And while extraneous lock wouldn't normally hurt, the sleeping locks
> (both mutexes and semaphores) aren't actually safe wrt de-allocation -
> they protect anything *inside* the lock, but the lock data structure
> itself is accessed racily wrt other lockers (in a way that still
> leaves the locked region protected, but not the lock itself). If you
> care about details, you can walk through my example.

Yes, this makes sense now. It was spin_unlock_mutex() on the pipe lock
that itself was already already freed and poisoned by another cpu. This
explicit poison check also fires:

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index bf156de..ae425d0 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -159,6 +159,7 @@ static __always_inline void 
arch_spin_unlock(arch_spinlock_t *lock)
__ticket_unlock_slowpath(lock, prev);
} else
__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
+   WARN_ON(*(unsigned int *)&lock->tickets.head == 0x6b6b6b6c);
 }
 
 static inline int arch_spin_is_locked(arch_spinlock_t *lock)

It warns only as often as the poison checking already did, with a stack
of warn_*, __mutex_unlock_slowpath(), mutex_unlock(), pipe_release().

Trying to prove a negative, of course, but I tested with your first fix
overnight and got no errors. Current git (with b0d8d2292160bb63de) also
looks good. I will leave it running for a few days.

Thanks for getting stuck on this one. It was educational, at least!

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races

2013-12-04 Thread Simon Kirby
On Tue, Dec 03, 2013 at 09:52:33AM +0100, Ingo Molnar wrote:

> Indeed: this comes from mutex->count being separate from 
> mutex->wait_lock, and this should affect every architecture that has a 
> mutex->count fast-path implemented (essentially every architecture 
> that matters).
> 
> Such bugs should also magically go away with mutex debugging enabled.

Confirmed: I ran the reproducer with CONFIG_DEBUG_MUTEXES for a few
hours, and never got a single poison overwritten notice.

> I'd expect such bugs to be more prominent with unlucky object 
> size/alignment: if mutex->count lies on a separate cache line from 
> mutex->wait_lock.
> 
> Side note: this might be a valid light weight debugging technique, we 
> could add padding between the two fields to force them into separate 
> cache lines, without slowing it down.
> 
> Simon, would you be willing to try the fairly trivial patch below? 
> Please enable CONFIG_DEBUG_MUTEX_FASTPATH=y. Does your kernel fail 
> faster that way?

I didn't see much of a change other than the incremented poison byte is
now further in due to the padding, and it shows up in kmalloc-256.

I also tried with Linus' udelay() suggestion, below. With this, there
were many occurrences per second.

Simon-

diff --git a/kernel/mutex.c b/kernel/mutex.c
index d24105b..f65e735 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * In the DEBUG case we are using the "NULL fastpath" for mutexes,
@@ -740,6 +741,11 @@ __mutex_unlock_common_slowpath(atomic_t *lock_count, int 
nested)
wake_up_process(waiter->task);
}
 
+   /* udelay a bit if the spinlock isn't contended */
+   if (lock->wait_lock.rlock.raw_lock.tickets.head + 1 ==
+   lock->wait_lock.rlock.raw_lock.tickets.tail)
+   udelay(1);
+
spin_unlock_mutex(&lock->wait_lock, flags);
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races

2013-12-04 Thread Simon Kirby
On Tue, Dec 03, 2013 at 10:10:29AM -0800, Linus Torvalds wrote:

> On Tue, Dec 3, 2013 at 12:52 AM, Ingo Molnar  wrote:
> >
> > I'd expect such bugs to be more prominent with unlucky object
> > size/alignment: if mutex->count lies on a separate cache line from
> > mutex->wait_lock.
> 
> I doubt that makes much of a difference. It's still just "CPU cycles"
> away, and the window will be tiny unless you have multi-socket
> machines and/or are just very unlucky.
> 
> For stress-testing, it would be much better to use some hack like
> 
> /* udelay a bit if the spinlock isn't contended */
> if (mutex->wait_lock.ticket.head+1 == mutex->wait_lock.ticket.tail)
> udelay(1);
> 
> in __mutex_unlock_common() just before the spin_unlock(). Make the
> window really *big*.

I haven't had a chance yet to do much testing of the proposed race fix
and race enlarging patches, but I did write a tool to reproduce the race.
I started it running and left for dinner, and sure enough, it actually
seems to work on plain 3.12 on a Dell PowerEdge r410 w/dual E5520,
reproducing "Poison overwritten" at a rate of about once every 15 minutes
(running 6 in parallel, booted with "slub_debug").

I have no idea if actually relying on tsc alignment between cores and
sockets is a reasonable idea these days, but it seems to work. I first
used a read() on a pipe close()d by the other process to synchronize
them, but this didn't seem to work as well as busy-waiting until the
timestamp counters pass a previously-decided-upon start time.

Meanwhile, I still don't understand how moving the unlock _up_ to cover
less of the code can solve the race, but I will stare at your long
explanation more tomorrow.

Simon-
#include 
#include 
#include 
#include 
#include 

#define MAX_PIPES 450
#define FORK_CYCLES 200
#define CLOSE_CYCLES 10
#define STAT_SHIFT 6

static inline unsigned long readtsc()
{
	unsigned int low, high;
	asm volatile("rdtsc" : "=a" (low), "=d" (high));
	return low | ((unsigned long)(high) << 32);
}

static int pipefd[MAX_PIPES][2];

int main(int argc, char *argv[]) {
	unsigned long loop, race_start, miss;
	unsigned long misses = 0, races = 0;
	int i;
	pid_t pid;
	struct rlimit rlim = {
		.rlim_cur = MAX_PIPES * 2 + 96,
		.rlim_max = MAX_PIPES * 2 + 96,
	};

	if (setrlimit(RLIMIT_NOFILE, &rlim) != 0)
		perror("setrlimit(RLIMIT_NOFILE)");

	for (loop = 1;;loop++) {
		/*
		 * Make a bunch of pipes
		 */
		for (i = 0;i < MAX_PIPES;i++) {
			if (pipe(pipefd[i]) == -1) {
perror("pipe()");
exit(EXIT_FAILURE);
			}
		}

		race_start = readtsc() + FORK_CYCLES;

		asm("":::"memory");

		pid = fork();
		if (pid == -1) {
			perror("fork()");
			exit(EXIT_FAILURE);
		}
		pid = !!pid;

		/*
		 * Close one pipe descriptor per process
		 */
		for (i = 0;i < MAX_PIPES;i++)
			close(pipefd[i][!pid]);

		for (i = 0;i < MAX_PIPES;i++) {
			/*
			 * Line up and try to close at the same time
			 */
			miss = 1;
			while (readtsc() < race_start)
miss = 0;

			close(pipefd[i][pid]);

			misses+= miss;
			race_start+= CLOSE_CYCLES;
		}
		races+= MAX_PIPES;

		if (!(loop & (~(~0UL << STAT_SHIFT
			fprintf(stderr, "%c %lu (%.2f%% false starts)\n",
"CP"[pid], readtsc(), misses * 100. / races);

		if (pid)
			wait(NULL);	/* Parent */
		else
			exit(0);	/* Child */
	}
}



Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()

2013-11-30 Thread Simon Kirby
On Sat, Nov 30, 2013 at 09:25:33AM -0800, Linus Torvalds wrote:

> On Sat, Nov 30, 2013 at 1:43 AM, Simon Kirby  wrote:
>
> > I turned on kmalloc-192 tracing to find what else is using it: struct
> > nfs_fh, struct bio, and struct cred. Poking around those, struct bio has
> > bi_cnt, but it is way down in the struct. struct cred has "usage", but it
> > comes first. Hmm. Nevertheless, I set:
> >
> > CONFIG_DEBUG_MUTEXES=y
> > CONFIG_DEBUG_LIST=y
> > CONFIG_DEBUG_CREDENTIALS=y
> >
> > And tried:
> >
> > diff --git a/include/linux/cred.h b/include/linux/cred.h
> > index 04421e8..2646fe9 100644
> > --- a/include/linux/cred.h
> > +++ b/include/linux/cred.h
> > @@ -205,7 +205,9 @@ static inline void validate_process_creds(void)
> >   */
> >  static inline struct cred *get_new_cred(struct cred *cred)
> >  {
> > -   atomic_inc(&cred->usage);
> > +   if (atomic_inc_return(&cred->usage) == 0x6c) {
> > +   WARN_ON(cred->uid == 0x6b);
> 
> Oh, damn, I thought you had found it, and got very excited and already
> wrote a long email about things I wanted you to try. And then I
> started looking closer...
> 
> That test is wrong. Both of those fields are 32-bit, so testing them
> against 0x6b/0x6c is bogus: you're just catching real cases. The
> reason it catches omreport is presumably because omreport runs as some
> special user that happens to have uid 107 (on my machine that happens
> to be qemu). And having a usage count of 108 isn't particularly
> strange either - creds get a lot of re-use.
> 
> So close. It *might* still be one of those cases, but it doesn't
> really sound very likely. "bi_cnt" is deep inside the struct bio, and
> "usage" is at offset 0, not offset 4. And the ns_fh isn't very
> interesting.

*head smack* Too much 8-bit AVR coding...

Makes sense now: uid=107(nagios) gid=109(nagios) groups=109(nagios)

Well, the chances of atomic_inc intentionally incrementing 0x6b6b6b6b are
probably pretty rare. I'll try that.

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()

2013-11-30 Thread Simon Kirby
On Tue, Nov 26, 2013 at 03:16:09PM -0800, Linus Torvalds wrote:

> On Mon, Nov 25, 2013 at 4:44 PM, Simon Kirby  wrote:
> >
> > I was hoping this or something else by 3.12 would have fixed it, so after
> > testing we deployed this everywhere and turned off the rest of the debug
> > options. I missed slub_debug on one server, though...and it just hit
> > another case of overwritten poison.
> 
> Your thing is *very* consistent, it's once more four bytes into that
> pipe-info. And it's once more that exact same "increment second word
> in the allocation" pattern.
> 
> > Is it true that with slub_debug, aliasing of equal-sized objects is
> > turned off, and so they shouldn't be immediately side-by-side? In other
> > words, would there be similar scrawling victim chances as allocating
> > pipe_inode_info with pages instead of slabs? "slabinfo -a" is empty.
> 
> So the thing is, with slub debugging, slub shouldn't be merging
> different slab caches.
> 
> HOWEVER.
> 
> The pipe-info structure isn't using its own slab cache, it's just
> using "kmalloc()". So it by definition will merge with all other
> kmalloc() allocations of the same size (or, to be exact, of "similar
> enough size to hit the same size bucket"). In your case it's the
> 192-byte-sized bucket.

I turned on kmalloc-192 tracing to find what else is using it: struct
nfs_fh, struct bio, and struct cred. Poking around those, struct bio has
bi_cnt, but it is way down in the struct. struct cred has "usage", but it
comes first. Hmm. Nevertheless, I set:

CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LIST=y
CONFIG_DEBUG_CREDENTIALS=y

And tried:

diff --git a/include/linux/bio.h b/include/linux/bio.h
index ec48bac..216dc43 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -168,7 +168,7 @@ static inline void *bio_data(struct bio *bio)
  * returns. and then bio would be freed memory when if (bio->bi_flags ...)
  * runs
  */
-#define bio_get(bio)   atomic_inc(&(bio)->bi_cnt)
+#define bio_get(bio)   WARN_ON(atomic_inc_return(&(bio)->bi_cnt) == 0x6c)
 
 #if defined(CONFIG_BLK_DEV_INTEGRITY)
 /*
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 04421e8..2646fe9 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -205,7 +205,9 @@ static inline void validate_process_creds(void)
  */
 static inline struct cred *get_new_cred(struct cred *cred)
 {
-   atomic_inc(&cred->usage);
+   if (atomic_inc_return(&cred->usage) == 0x6c) {
+   WARN_ON(cred->uid == 0x6b);
+   }
return cred;
 }
 
On the same server, this last hunk warned fairly quickly:

[  850.303535] [ cut here ]
[  850.312774] WARNING: CPU: 3 PID: 6169 at include/linux/cred.h:209 
get_empty_filp+0x109/0x1b0()
[  850.329974] Modules linked in: ipmi_devintf aoe ipmi_si bnx2 ipmi_msghandler 
evdev serio_raw
[  850.346913] CPU: 3 PID: 6169 Comm: omreport Not tainted 
3.12.0-hw-debug-mutexes+ #83
[  850.362374] Hardware name: Dell Inc. PowerEdge 1950/0UR033, BIOS 2.0.1 
10/30/2007
[  850.377316]  0009 880428d0fd28 817f2407 
88043fccf9e8
[  850.392134]   880428d0fd68 8105a537 
880428d0fd58
[  850.406936]  880428d89e00 88042960f480 880428d0ff24 
88042a19
[  850.421746] Call Trace:
[  850.426627]  [] dump_stack+0x46/0x58
[  850.436888]  [] warn_slowpath_common+0x87/0xb0
[  850.448878]  [] warn_slowpath_null+0x15/0x20
[  850.460523]  [] get_empty_filp+0x109/0x1b0
[  850.471818]  [] path_openat+0x43/0x660
[  850.482426]  [] ? fcntl_setlk+0x5b/0x2d0
[  850.493391]  [] do_filp_open+0x3e/0xa0
[  850.504008]  [] ? mntput_no_expire+0x44/0x130
[  850.515842]  [] ? __alloc_fd+0x42/0x110
[  850.526630]  [] do_sys_open+0x13c/0x230
[  850.537428]  [] compat_SyS_open+0x16/0x20
[  850.548579]  [] sysenter_dispatch+0x7/0x25
[  850.559888] ---[ end trace acdbea3e141dbaec ]---

All traces are the same, and all Comms are "omreport", which is from the
Dell OpenManage tools blob, executed regularly for RAID monitoring.
Running it directly does not seem to cause the warning. kern.log shows it
seems to warn every 20 minutes. No warnings from CONFIG_DEBUG_CREDENTIALS
magic checking at all.

Is there anything interesting about this tool? It is 32-bit. I can hook
path_openat() and check for the cred contents there to print the path, if
that would help.

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()

2013-11-25 Thread Simon Kirby
On Tue, Aug 20, 2013 at 12:51:11AM -0700, Ian Applegate wrote:

> Unfortunately no boxen with CONFIG_DEBUG_MUTEXES among them. I can
> enable on a few and should have some results within the day. These
> mainly serve (quite a bit of) HTTP/S cache traffic.
> 
> On Tue, Aug 20, 2013 at 12:21 AM, Al Viro  wrote:
> > On Tue, Aug 20, 2013 at 12:17:52AM -0700, Ian Applegate wrote:
> >> We are also seeing this or a similar issue. On a fairly widespread
> >> deployment of 3.10.1 & 3.10.6 this occurred fairly consistently on the
> >> order of 36 days (combined MTBF.)
> >
> > Do you have any boxen with CONFIG_DEBUG_MUTEXES among those?  What
> > kind of setup do you have on those, BTW?

Hmm. So, we went through a few months of running with Linus' suggested
culprit-catching patch w/DEBUG_PAGE_ALLOC, but it never tripped. We also
ran with DEBUG_MUTEXES, but that never seem to catch anything, either.

Ian, is it true that what you saw involved no btrfs? I was still guessing
this is related to btrfs, as we are only seeing this on boxes doing btrfs
rsync-snapshot backups. I don't know what else is interesting about our
workload there, since we're not doing anything exotic.

Meanwhille, with DEBUG_LIST on 3.12-rc, we found list corruption, which
Josef fixed in 93858769172c4e3678917810e9d5de360eb991cc. This missed
3.12, unfortunately, so I built a 3.12 with Josef's btrfs-next merged (to
54563d41a58be77e9bd9ef7af1ea4026cf0e7e07, which contained that fix).

I was hoping this or something else by 3.12 would have fixed it, so after
testing we deployed this everywhere and turned off the rest of the debug
options. I missed slub_debug on one server, though...and it just hit
another case of overwritten poison.

Is it true that with slub_debug, aliasing of equal-sized objects is
turned off, and so they shouldn't be immediately side-by-side? In other
words, would there be similar scrawling victim chances as allocating
pipe_inode_info with pages instead of slabs? "slabinfo -a" is empty.

[158037.526662] 
=
[158037.528014] BUG kmalloc-192 (Not tainted): Poison overwritten
[158037.528014] 
-
[158037.528014] 
[158037.528014] Disabling lock debugging due to kernel taint
[158037.528014] INFO: 0x88013af3da6c-0x88013af3da6c. First byte 0x6c 
instead of 0x6b
[158037.528014] INFO: Allocated in alloc_pipe_info+0x1f/0xb0 age=22 cpu=3 
pid=26402
[158037.528014] __slab_alloc.constprop.63+0x35b/0x3a0
[158037.528014] kmem_cache_alloc_trace+0xab/0x110
[158037.528014] alloc_pipe_info+0x1f/0xb0
[158037.528014] create_pipe_files+0x41/0x1f0
[158037.528014] __do_pipe_flags+0x3c/0xb0   
[158037.528014] SyS_pipe2+0x1b/0xa0
[158037.528014] SyS_pipe+0xb/0x10  
[158037.528014] system_call_fastpath+0x16/0x1b
[158037.528014] INFO: Freed in free_pipe_info+0x6a/0x70 age=39 cpu=1 pid=26402
[158037.528014] __slab_free+0x2d/0x2d4
[158037.528014] kfree+0xfd/0x130
[158037.528014] free_pipe_info+0x6a/0x70
[158037.528014] pipe_release+0x94/0xf0  
[158037.528014] __fput+0xa7/0x230
[158037.528014] fput+0x9/0x10
[158037.528014] task_work_run+0x97/0xd0
[158037.528014] do_notify_resume+0x66/0x70
[158037.528014] int_signal+0x12/0x17
[158037.528014] INFO: Slab 0xea0004ebcf00 objects=31 used=29 
fp=0x88013af3e080 flags=0x80004080
[158037.528014] INFO: Object 0x88013af3da68 @offset=6760 
fp=0x88013af3ca28
[158037.528014] 
[158037.528014] Bytes b4 88013af3da58: 97 b8 59 02 01 00 00 00 5a 5a 5a 5a 
5a 5a 5a 5a  ..Y.
[158037.528014] Object 88013af3da68: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b  lkkk  
[158037.528014] Object 88013af3da78: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b    
[158037.528014] Object 88013af3da88: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b    
[158037.528014] Object 88013af3da98: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b    
[158037.528014] Object 88013af3daa8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b    
[158037.528014] Object 88013af3dab8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b    
[158037.528014] Object 88013af3dac8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b    
[158037.528014] Object 88013af3dad8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b    
[158037.528014] Object 88013af3dae8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b    
[158037.528014] Object 88013af3daf8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b    
[158037.528014] Object 88013af3db08: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b    
[158037.528014] Object 88013af3db18: 6b 6

Re: [3.12-rc] sg_open: leaving the kernel with locks still held!

2013-10-24 Thread Simon Kirby
On Wed, Oct 23, 2013 at 10:10:47AM -0400, Douglas Gilbert wrote:

> On 13-10-23 03:44 AM, James Bottomley wrote:
> >On Tue, 2013-10-22 at 20:41 -0400, Douglas Gilbert wrote:
> >>On 13-10-22 04:56 PM, Simon Kirby wrote:
> >>>Hello!
> >>>
> >>>While trying to figure out why the request queue to sda (ext4) was
> >>>clogging up on one of our btrfs backup boxes, I noticed a megarc process
> >>>in D state, so enabled locking debugging, and got this (on 3.12-rc6):
> >>>
> >>>[  205.372823] 
> >>>[  205.372901] [ BUG: lock held when returning to user space! ]
> >>>[  205.372979] 3.12.0-rc6-hw-debug-pagealloc+ #67 Not tainted
> >>>[  205.373055] 
> >>>[  205.373132] megarc.bin/5283 is leaving the kernel with locks still held!
> >>>[  205.373212] 1 lock held by megarc.bin/5283:
> >>>[  205.373285]  #0:  (&sdp->o_sem){.+.+..}, at: [] 
> >>>sg_open+0x3a0/0x4d0
> >>>
> >>>Vaughan, it seems you touched this area last in 15b06f9a02406e, and git
> >>>tag --contains says this went in for 3.12-rc. We didn't see this on 3.11,
> >>>though I haven't tried with lockdep.
> >>>
> >>>This is caused by some of our internal RAID monitoring scripts that run
> >>>"megarc.bin -dispCfg -a0" (even though that controller isn't present on
> >>>this server -- a PowerEdge 2950 w/Perc 5).
> >>>
> >>>strace output of the program execution that causes the above message is
> >>>here: http://0x.ca/sim/ref/3.12-rc6/megarc_strace.txt
> >>
> >>This has been reported. That patch will be reverted or,
> >>if there is enough time, a fix will (or at least should)
> >>go in before the release of lk 3.12 .
> >
> >I think you've got about a week to prove you can fix it (before 3.12
> >goes final).  I'll send my current set of fixes to Linus without doing
> >anything about sg.
> 
> "prove" is a big ask, especially coming from a
> mathematician. I consider it more hacking (in the
> golf sense) on my part to tweak well-meaning patches
> to the sg driver that cause collateral damage. Further,
> I suspect Vaughan's patch was an attempt to fix
> damage left be a previous sg_open() hacker.
> 
> I have asked Simon Kirby to apply the patch:
>   http://marc.info/?l=linux-scsi&m=138237283432010&w=2
> and report if it fixes his problems. Further I have
> written three test programs to test O_EXCL handling on
> SCSI devices, two of which are in the examples directory
> of sg3_utils version 1.37 . The latest one (single
> exclusive writer, multiple readers) can be found in
> the News section of:
>http://sg.danny.cz/sg/
> These tests don't check all possibilities (e.g. random
> signals, ml error processing and detached devices) but
> they are better than nothing. And, as a side issue, they
> break bsg (cause it ignores O_EXCL) and break the block
> layer (e.g. /dev/sdb) so perhaps it should be reverted :-)

Well, this patch works for me in that I see no more lockdep warnings or
unintended consequences when running the same "megarc.bin -dispCfg -a0"
command.

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[3.12-rc] sg_open: leaving the kernel with locks still held!

2013-10-22 Thread Simon Kirby
Hello!

While trying to figure out why the request queue to sda (ext4) was
clogging up on one of our btrfs backup boxes, I noticed a megarc process
in D state, so enabled locking debugging, and got this (on 3.12-rc6):

[  205.372823] 
[  205.372901] [ BUG: lock held when returning to user space! ]
[  205.372979] 3.12.0-rc6-hw-debug-pagealloc+ #67 Not tainted
[  205.373055] 
[  205.373132] megarc.bin/5283 is leaving the kernel with locks still held!
[  205.373212] 1 lock held by megarc.bin/5283:
[  205.373285]  #0:  (&sdp->o_sem){.+.+..}, at: [] 
sg_open+0x3a0/0x4d0

Vaughan, it seems you touched this area last in 15b06f9a02406e, and git
tag --contains says this went in for 3.12-rc. We didn't see this on 3.11,
though I haven't tried with lockdep.

This is caused by some of our internal RAID monitoring scripts that run
"megarc.bin -dispCfg -a0" (even though that controller isn't present on
this server -- a PowerEdge 2950 w/Perc 5).

strace output of the program execution that causes the above message is
here: http://0x.ca/sim/ref/3.12-rc6/megarc_strace.txt

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()

2013-09-03 Thread Simon Kirby
On Mon, Aug 19, 2013 at 04:31:38PM -0700, Simon Kirby wrote:

> On Mon, Aug 19, 2013 at 05:24:41PM -0400, Chris Mason wrote:
> 
> > Quoting Linus Torvalds (2013-08-19 17:16:36)
> > > On Mon, Aug 19, 2013 at 1:29 PM, Christoph Lameter  
> > > wrote:
> > > > On Mon, 19 Aug 2013, Simon Kirby wrote:
> > > >
> > > >>[... ]  The
> > > >> alloc/free traces are always the same -- always alloc_pipe_info and
> > > >> free_pipe_info. This is seen on 3.10 and (now) 3.11-rc4:
> > > >>
> > > >> Object 880090f19e78: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
> > > >> 6b  lkkk
> > > >
> > > > This looks like an increment after free in the second 32 bit value of 
> > > > the
> > > > structure. First 32 bit value's poison is unchanged.
> > > 
> > > Ugh. If that is "struct pipe_inode_info" and I read it right, that's
> > > the "wait_lock" spinlock that is part of the mutex.
> > > 
> > > Doing a "spin_lock()" could indeed cause an increment operation. But
> > > it still sounds like a very odd case. And even for some wild pointer
> > > I'd then expect the spin_unlock to also happen, and to then increment
> > > the next byte (or word) too. More importantly, for a mutex, I'd expect
> > > the *other* fields to be corrupted too (the "waiter" field etc). That
> > > is, unless we're still spinning waiting for the mutex, but with that
> > > value we shouldn't, as far as I can see.
> > > 
> > 
> > Simon, is this box doing btrfs send/receive?  If so, it's probably where
> > this pipe is coming from.
> 
> No, not for some time (a few kernel versions ago).
> 
> > Linus' CONFIG_DEBUG_PAGE_ALLOC suggestions are going to be the fastest
> > way to find it, I can give you a patch if it'll help.
> 
> I presume it's just:
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index d2c45e1..30d5b8d 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -780,7 +780,7 @@ struct pipe_inode_info *alloc_pipe_info(void)
>  {
>   struct pipe_inode_info *pipe;
>  
> - pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL);
> + pipe = (void *)get_zeroed_page(GFP_KERNEL);
>   if (pipe) {
>   pipe->bufs = kzalloc(sizeof(struct pipe_buffer) * 
> PIPE_DEF_BUFFERS, GFP_KERNEL);
>   if (pipe->bufs) {
> @@ -790,7 +790,7 @@ struct pipe_inode_info *alloc_pipe_info(void)
>   mutex_init(&pipe->mutex);
>   return pipe;
>   }
> - kfree(pipe);
> + free_page((unsigned long)pipe);
>   }
>  
>   return NULL;
> @@ -808,7 +808,7 @@ void free_pipe_info(struct pipe_inode_info *pipe)
>   if (pipe->tmp_page)
>   __free_page(pipe->tmp_page);
>   kfree(pipe->bufs);
> - kfree(pipe);
> + free_page((unsigned long)pipe);
>  }
>  
>  static struct vfsmount *pipe_mnt __read_mostly;
> 
> ...and CONFIG_DEBUG_PAGEALLOC enabled.
> 
> > It would be nice if you could trigger this on plain 3.11-rcX instead of
> > btrfs-next.
> 
> On 3.10 it was with some btrfs-next pulled in, but the 3.11-rc4 traces
> were from 3.11-rc4 with just some of our local patches:
> 
> > git diff --stat v3.11-rc4..master
>  firmware/Makefile |4 +-
>  firmware/bnx2/bnx2-mips-06-6.2.3.fw.ihex  | 5804 ++
>  firmware/bnx2/bnx2-mips-09-6.2.1b.fw.ihex | 6496 +
>  kernel/acct.c |   21 +-
>  net/sunrpc/auth.c |2 +-
>  net/sunrpc/clnt.c |   10 +
>  net/sunrpc/xprt.c |8 +-
>  7 files changed, 12335 insertions(+), 10 deletions(-)
> 
> None of them look relevant, but I'm building vanilla -rc4 with
> CONFIG_DEBUG_PAGEALLOC and the patch above.

Stock 3.11-rc4 plus the above get_zeroed_page() for pipe allocations has
been running since August 19th on a few btrfs boxes. It has been quiet
until a few days ago, where we hit this:

BUG: soft lockup - CPU#5 stuck for 22s! [btrfs-cleaner:5871]
Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler aoe serio_raw bnx2 evdev
CPU: 5 PID: 5871 Comm: btrfs-cleaner Not tainted 3.11.0-rc4-hw+ #48
Hardware name: Dell Inc. PowerEdge 2950/0NH278, BIOS 2.7.0 10/30/2010
task: 8804261117d0 ti: 8804120d8000 task.ti: 8804120d8000
RIP: 0010:[]  [] 
_raw_spin_unlock_irqrestore+0xc/0x20
RSP: 0018:8804120d98b8  EFLAGS: 0296

Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()

2013-08-19 Thread Simon Kirby
On Mon, Aug 19, 2013 at 05:24:41PM -0400, Chris Mason wrote:

> Quoting Linus Torvalds (2013-08-19 17:16:36)
> > On Mon, Aug 19, 2013 at 1:29 PM, Christoph Lameter  wrote:
> > > On Mon, 19 Aug 2013, Simon Kirby wrote:
> > >
> > >>[... ]  The
> > >> alloc/free traces are always the same -- always alloc_pipe_info and
> > >> free_pipe_info. This is seen on 3.10 and (now) 3.11-rc4:
> > >>
> > >> Object 880090f19e78: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
> > >>  lkkk
> > >
> > > This looks like an increment after free in the second 32 bit value of the
> > > structure. First 32 bit value's poison is unchanged.
> > 
> > Ugh. If that is "struct pipe_inode_info" and I read it right, that's
> > the "wait_lock" spinlock that is part of the mutex.
> > 
> > Doing a "spin_lock()" could indeed cause an increment operation. But
> > it still sounds like a very odd case. And even for some wild pointer
> > I'd then expect the spin_unlock to also happen, and to then increment
> > the next byte (or word) too. More importantly, for a mutex, I'd expect
> > the *other* fields to be corrupted too (the "waiter" field etc). That
> > is, unless we're still spinning waiting for the mutex, but with that
> > value we shouldn't, as far as I can see.
> > 
> 
> Simon, is this box doing btrfs send/receive?  If so, it's probably where
> this pipe is coming from.

No, not for some time (a few kernel versions ago).

> Linus' CONFIG_DEBUG_PAGE_ALLOC suggestions are going to be the fastest
> way to find it, I can give you a patch if it'll help.

I presume it's just:

diff --git a/fs/pipe.c b/fs/pipe.c
index d2c45e1..30d5b8d 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -780,7 +780,7 @@ struct pipe_inode_info *alloc_pipe_info(void)
 {
struct pipe_inode_info *pipe;
 
-   pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL);
+   pipe = (void *)get_zeroed_page(GFP_KERNEL);
if (pipe) {
pipe->bufs = kzalloc(sizeof(struct pipe_buffer) * 
PIPE_DEF_BUFFERS, GFP_KERNEL);
if (pipe->bufs) {
@@ -790,7 +790,7 @@ struct pipe_inode_info *alloc_pipe_info(void)
mutex_init(&pipe->mutex);
return pipe;
}
-   kfree(pipe);
+   free_page((unsigned long)pipe);
}
 
return NULL;
@@ -808,7 +808,7 @@ void free_pipe_info(struct pipe_inode_info *pipe)
if (pipe->tmp_page)
__free_page(pipe->tmp_page);
kfree(pipe->bufs);
-   kfree(pipe);
+   free_page((unsigned long)pipe);
 }
 
 static struct vfsmount *pipe_mnt __read_mostly;

...and CONFIG_DEBUG_PAGEALLOC enabled.

> It would be nice if you could trigger this on plain 3.11-rcX instead of
> btrfs-next.

On 3.10 it was with some btrfs-next pulled in, but the 3.11-rc4 traces
were from 3.11-rc4 with just some of our local patches:

> git diff --stat v3.11-rc4..master
 firmware/Makefile |4 +-
 firmware/bnx2/bnx2-mips-06-6.2.3.fw.ihex  | 5804 ++
 firmware/bnx2/bnx2-mips-09-6.2.1b.fw.ihex | 6496 +
 kernel/acct.c |   21 +-
 net/sunrpc/auth.c |2 +-
 net/sunrpc/clnt.c |   10 +
 net/sunrpc/xprt.c |8 +-
 7 files changed, 12335 insertions(+), 10 deletions(-)

None of them look relevant, but I'm building vanilla -rc4 with
CONFIG_DEBUG_PAGEALLOC and the patch above.

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()

2013-08-19 Thread Simon Kirby
On Sat, Jul 06, 2013 at 11:27:38AM +0300, Pekka Enberg wrote:

> On Sat, Jul 6, 2013 at 3:09 AM, Simon Kirby  wrote:
> > We saw two Oopses overnight on two separate boxes that seem possibly
> > related, but both are weird. These boxes typically run btrfs for rsync
> > snapshot backups (and usually Oops in btrfs ;), but not this time!
> > backup02 was running 3.10-rc6 plus btrfs-next at the time, and backup03
> > was running 3.10 release plus btrfs-next from yesterday. Full kern.log
> > and .config at http://0x.ca/sim/ref/3.10/
> >
> > backup02's first Oops:
> >
> > BUG: unable to handle kernel paging request at 0001
> > IP: [] kmem_cache_alloc+0x4b/0x110
> > PGD 1f54f7067 PUD 0
> > Oops:  [#1] SMP
> > Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler aoe microcode 
> > serio_raw bnx2 evdev
> > CPU: 0 PID: 23112 Comm: ionice Not tainted 3.10.0-rc6-hw+ #46
> > Hardware name: Dell Inc. PowerEdge 2950/0NH278, BIOS 2.7.0 10/30/2010
> > task: 8802c3f08000 ti: 8801b4876000 task.ti: 8801b4876000
> > RIP: 0010:[]  [] 
> > kmem_cache_alloc+0x4b/0x110
> > RSP: 0018:8801b4877e88  EFLAGS: 00010206
> > RAX:  RBX: 8802c3f08000 RCX: 017f040e
> > RDX: 017f040d RSI: 00d0 RDI: 8107a503
> > RBP: 8801b4877ec8 R08: 00016a80 R09: 
> > R10: 7fff025fe120 R11: 0246 R12: 00d0
> > R13: 88042d8019c0 R14: 0001 R15: 7fc3588ee97f
> > FS:  () GS:88043fc0() knlGS:
> > CS:  0010 DS:  ES:  CR0: 8005003b
> > CR2: 0001 CR3: 000409d68000 CR4: 07f0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: 0ff0 DR7: 0400
> > Stack:
> >  8801b4877ed8 8112a1bc 8800985acd20 8802c3f08000
> >  0001 7fc3588ee334 7fc358af5758 7fc3588ee97f
> >  8801b4877ee8 8107a503 8801b4877ee8 ffea
> > Call Trace:
> >  [] ? __fput+0x12c/0x240
> >  [] prepare_creds+0x23/0x150
> >  [] SyS_faccessat+0x34/0x1f0
> >  [] SyS_access+0x13/0x20
> >  [] system_call_fastpath+0x16/0x1b
> > Code: 75 f0 4c 89 7d f8 49 8b 4d 00 65 48 03 0c 25 68 da 00 00 48 8b 51 08 
> > 4c 8b 31 4d 85 f6 74 5f 49 63 45 20 4d 8b 45 00 48 8d 4a 01 <49> 8b 1c 06 
> > 4c 89 f0 65 49 0f c7 08 0f 94 c0 84 c0 74 c8 49 63
> > RIP  [] kmem_cache_alloc+0x4b/0x110
> >  RSP 
> > CR2: 0001
> > ---[ end trace 744477356cd98306 ]---
> >
> > backup03's first Oops:
> >
> > BUG: unable to handle kernel paging request at 880502efc240
> > IP: [] kmem_cache_alloc+0x4b/0x110
> > PGD 1d3a067 PUD 0
> > Oops:  [#1] SMP
> > Modules linked in: aoe ipmi_devintf ipmi_msghandler bnx2 microcode 
> > serio_raw evdev
> > CPU: 6 PID: 14066 Comm: perl Not tainted 3.10.0-hw+ #2
> > Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.11.0 07/23/2012
> > task: 88040111c3b0 ti: 8803c23ae000 task.ti: 8803c23ae000
> > RIP: 0010:[]  [] 
> > kmem_cache_alloc+0x4b/0x110
> > RSP: 0018:8803c23afd90  EFLAGS: 00010282
> > RAX:  RBX: 88040111c3b0 RCX: 0002a76e
> > RDX: 0002a76d RSI: 00d0 RDI: 8107a4e3
> > RBP: 8803c23afdd0 R08: 00016a80 R09: 
> > R10: fffe R11: ffd0 R12: 00d0
> > R13: 88041d403980 R14: 880502efc240 R15: 88010e375a40
> > FS:  7f2cae496700() GS:88041f2c() knlGS:
> > CS:  0010 DS:  ES:  CR0: 8005003b
> > CR2: 880502efc240 CR3: 0001e0ced000 CR4: 07e0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: 0ff0 DR7: 0400
> > Stack:
> >  8803c23afe98 8803c23afdb8 81133811 88040111c3b0
> >  88010e375a40 01200011 7f2cae4969d0 88010e375a40
> >  8803c23afdf0 8107a4e3 81b49b80 01200011
> > Call Trace:
> >  [] ? final_putname+0x21/0x50
> >  [] prepare_creds+0x23/0x150
> >  [] copy_creds+0x31/0x160
> >  [] ? unlazy_fpu+0x9b/0xb0
> >  [] copy_process.part.49+0x239/0x1390
> >  [] ? __alloc_fd+0x42/0x100
> >  [] do_fork+0xa4/0x320
> >  [] ? __do_pipe_flags+0x77/0xb0
> >  [] ? __fd_install+0x26/0x60
> >  [] SyS_clone+0x11/0x20
> >  [] s

[3.10] Oopses in kmem_cache_allocate() via prepare_creds()

2013-07-05 Thread Simon Kirby
We saw two Oopses overnight on two separate boxes that seem possibly
related, but both are weird. These boxes typically run btrfs for rsync
snapshot backups (and usually Oops in btrfs ;), but not this time!
backup02 was running 3.10-rc6 plus btrfs-next at the time, and backup03
was running 3.10 release plus btrfs-next from yesterday. Full kern.log
and .config at http://0x.ca/sim/ref/3.10/

backup02's first Oops:

BUG: unable to handle kernel paging request at 0001
IP: [] kmem_cache_alloc+0x4b/0x110
PGD 1f54f7067 PUD 0
Oops:  [#1] SMP
Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler aoe microcode serio_raw 
bnx2 evdev
CPU: 0 PID: 23112 Comm: ionice Not tainted 3.10.0-rc6-hw+ #46
Hardware name: Dell Inc. PowerEdge 2950/0NH278, BIOS 2.7.0 10/30/2010
task: 8802c3f08000 ti: 8801b4876000 task.ti: 8801b4876000
RIP: 0010:[]  [] kmem_cache_alloc+0x4b/0x110
RSP: 0018:8801b4877e88  EFLAGS: 00010206
RAX:  RBX: 8802c3f08000 RCX: 017f040e
RDX: 017f040d RSI: 00d0 RDI: 8107a503
RBP: 8801b4877ec8 R08: 00016a80 R09: 
R10: 7fff025fe120 R11: 0246 R12: 00d0
R13: 88042d8019c0 R14: 0001 R15: 7fc3588ee97f
FS:  () GS:88043fc0() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 0001 CR3: 000409d68000 CR4: 07f0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Stack:
 8801b4877ed8 8112a1bc 8800985acd20 8802c3f08000
 0001 7fc3588ee334 7fc358af5758 7fc3588ee97f
 8801b4877ee8 8107a503 8801b4877ee8 ffea
Call Trace:
 [] ? __fput+0x12c/0x240
 [] prepare_creds+0x23/0x150
 [] SyS_faccessat+0x34/0x1f0
 [] SyS_access+0x13/0x20
 [] system_call_fastpath+0x16/0x1b
Code: 75 f0 4c 89 7d f8 49 8b 4d 00 65 48 03 0c 25 68 da 00 00 48 8b 51 08 4c 
8b 31 4d 85 f6 74 5f 49 63 45 20 4d 8b 45 00 48 8d 4a 01 <49> 8b 1c 06 4c 89 f0 
65 49 0f c7 08 0f 94 c0 84 c0 74 c8 49 63
RIP  [] kmem_cache_alloc+0x4b/0x110
 RSP 
CR2: 0001
---[ end trace 744477356cd98306 ]---

backup03's first Oops:

BUG: unable to handle kernel paging request at 880502efc240
IP: [] kmem_cache_alloc+0x4b/0x110
PGD 1d3a067 PUD 0
Oops:  [#1] SMP
Modules linked in: aoe ipmi_devintf ipmi_msghandler bnx2 microcode serio_raw 
evdev
CPU: 6 PID: 14066 Comm: perl Not tainted 3.10.0-hw+ #2
Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.11.0 07/23/2012
task: 88040111c3b0 ti: 8803c23ae000 task.ti: 8803c23ae000
RIP: 0010:[]  [] kmem_cache_alloc+0x4b/0x110
RSP: 0018:8803c23afd90  EFLAGS: 00010282
RAX:  RBX: 88040111c3b0 RCX: 0002a76e
RDX: 0002a76d RSI: 00d0 RDI: 8107a4e3
RBP: 8803c23afdd0 R08: 00016a80 R09: 
R10: fffe R11: ffd0 R12: 00d0
R13: 88041d403980 R14: 880502efc240 R15: 88010e375a40
FS:  7f2cae496700() GS:88041f2c() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 880502efc240 CR3: 0001e0ced000 CR4: 07e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Stack:
 8803c23afe98 8803c23afdb8 81133811 88040111c3b0
 88010e375a40 01200011 7f2cae4969d0 88010e375a40
 8803c23afdf0 8107a4e3 81b49b80 01200011
Call Trace:
 [] ? final_putname+0x21/0x50
 [] prepare_creds+0x23/0x150
 [] copy_creds+0x31/0x160
 [] ? unlazy_fpu+0x9b/0xb0
 [] copy_process.part.49+0x239/0x1390
 [] ? __alloc_fd+0x42/0x100
 [] do_fork+0xa4/0x320
 [] ? __do_pipe_flags+0x77/0xb0
 [] ? __fd_install+0x26/0x60
 [] SyS_clone+0x11/0x20
 [] stub_clone+0x69/0x90
 [] ? system_call_fastpath+0x16/0x1b
Code: 75 f0 4c 89 7d f8 49 8b 4d 00 65 48 03 0c 25 68 da 00 00 48 8b 51 08 4c 
8b 31 4d 85 f6 74 5f 49 63 45 20 4d 8b 45 00 48 8d 4a 01 <49> 8b 1c 06 4c 89 f0 
65 49 0f c7 08 0f 94 c0 84 c0 74 c8 49 63
RIP  [] kmem_cache_alloc+0x4b/0x110
 RSP 
CR2: 880502efc240
---[ end trace 956d153150ecc57f ]---

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ 02/42] TTY: do not update atime/mtime on read/write

2013-05-02 Thread Simon Kirby
On Tue, Apr 30, 2013 at 06:41:44PM -0700, Linus Torvalds wrote:

> On Tue, Apr 30, 2013 at 5:57 PM, Linus Torvalds
>  wrote:
> >
> > Patch is whitespace-damaged and totally untested! Caveat applicator.
> 
> Ok, so it's still whitespace-damaged, but it seems to work. The
> appended has the "8 second rule" too..
> 
> Comments? Simon?

Tested -- both hunks seem to work as intended. Thanks!

Simon-

Below became b0b885657b6c8ef63a46bc9299b2a7715d19acde

>Linus
> 
> --- snip snip ---
>  drivers/tty/pty.c| 3 +++
>  drivers/tty/tty_io.c | 4 ++--
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
> index a62798fcc014..59bfaecc4e14 100644
> --- a/drivers/tty/pty.c
> +++ b/drivers/tty/pty.c
> @@ -681,6 +681,9 @@ static int ptmx_open(struct inode *inode, struct file 
> *filp)
> 
> nonseekable_open(inode, filp);
> 
> +   /* We refuse fsnotify events on ptmx, since it's a shared resource */
> +   filp->f_mode |= FMODE_NONOTIFY;
> +
> retval = tty_alloc_file(filp);
> if (retval)
> return retval;
> diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
> index 97ebc8c5864e..6464029e4860 100644
> --- a/drivers/tty/tty_io.c
> +++ b/drivers/tty/tty_io.c
> @@ -988,10 +988,10 @@ void start_tty(struct tty_struct *tty)
> 
>  EXPORT_SYMBOL(start_tty);
> 
> +/* We limit tty time update visibility to every 8 seconds or so. */
>  static void tty_update_time(struct timespec *time)
>  {
> -   unsigned long sec = get_seconds();
> -   sec -= sec % 60;
> +   unsigned long sec = get_seconds() & ~7;
> if ((long)(sec - time->tv_sec) > 0)
> time->tv_sec = sec;
>  }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ 02/42] TTY: do not update atime/mtime on read/write

2013-04-30 Thread Simon Kirby
On Mon, Apr 29, 2013 at 06:37:24PM -0700, Greg Kroah-Hartman wrote:

> On Mon, Apr 29, 2013 at 05:36:40PM -0700, Simon Kirby wrote:
> > On Mon, Apr 29, 2013 at 05:21:17PM -0700, Greg Kroah-Hartman wrote:
> > 
> > > On Mon, Apr 29, 2013 at 05:14:45PM -0700, Simon Kirby wrote:
> > > > On Mon, Apr 29, 2013 at 12:01:44PM -0700, Greg Kroah-Hartman wrote:
> > > > 
> > > > > 3.8-stable review patch.  If anyone has any objections, please let me 
> > > > > know.
> > > > 
> > > > I object. This breaks functionality I use every day (seeing who else is
> > > > working on stuff with "w").
> > > > 
> > > > Furthermore, the patch does not actually fix the hole referenced (see
> > > > ptmx-keystroke-latency.c on 
> > > > http://vladz.devzero.fr/013_ptmx-timing.php).
> > > > I can still reproduce the timing capture even with this patch applied
> > > > (in 3.9-rc8).
> > > 
> > > How?  There are no keystrokes being reported to other users, or did we
> > > miss something with this patch?
> > 
> > wget http://vladz.devzero.fr/svn/codes/PoC/ptmx-keystroke-latency.c
> > gcc -O ptmx-keystroke-latency ptmx-keystroke-latency.c
> > ./ptmx-keystroke-latency
> > 
> > Log in to another tty, as another user. See keystroke timing. 3.9-rc8.
> > 
> > Seems like it was missed. Meanwhile, idle times in "w" do not update.
> 
> Ah, it's using inotify on the /dev/ptmx device.  Jiri, your change
> really doesn't affect that at all :(
> 
> Simon, you mention a grsec change somewhere that addresses this issue.
> Any hints on where that would be?

Yes, see Jiri's comments in the original patch (b0de59b5733d):

http://vladz.devzero.fr/013_ptmx-timing.php

The grsec patch is linked from there:

http://grsecurity.net/~spender/sidechannel.diff

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] TTY: fix atime/mtime regression

2013-04-30 Thread Simon Kirby
On Fri, Apr 26, 2013 at 10:02:12AM -0700, Linus Torvalds wrote:

> On Fri, Apr 26, 2013 at 4:48 AM, Jiri Slaby  wrote:
> >
> > To revert to the old behaviour while still preventing attackers to
> > guess the password length, we update the timestamps in one-minute
> > intervals by this patch.
> 
> Thanks, applied.
> 
> And now that I see the behavior of "w", I can kind of understand why
> you picked 10s intervals. That "w" output is really really quite ugly.
> Talking about "27.00s" idle for the current terminal when we only
> update at even minutes ends up not being sensible.

Ah, so it was your suggestion to go with one minute.

I objected to the stable-backporting of this, since it was broken and
didn't actually fix the inotify path, but I care more about the time
granularity chosen here.

> Craig, background: the current git kernel (so 3.9, and these commits
> will presumably be back-ported) does not update tty timestamps very
> often, because you can use the timestamps to look at peoples typing
> behavior. Initially it didn't update the timestamps AT ALL, but that
> broke the whole idle routine. Now it updates it only at minute
> boundaries, so things like "w" _work_, but the hundreth-of-a-second
> idle precision is obviously just totally random noise.
> 
> Not a biggie, I doubt I would even have noticed unless I was
> explicitly looking at that field, but

I look at this field all the time, and would really like to see seconds.
Surely anybody typing a password types it faster than 1 character per
second. Why stretch it out so much? Can we at least make it 10 seconds?

Simon-

---

Subject: [PATCH] TTY: increase atime/mtime update rate

37b7f3c76595 introduces an update interval for TTY atime updates, making
"w"'s IDLE column less useful than in the past. Since this is often used
for checking to see if other users are actually using the system, reduce
the time to 10 seconds.

Signed-off-by: Simon Kirby 
Cc:  # follow 37b7f3c76595e23257f61bd80b223de865
---
 drivers/tty/tty_io.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index b045268..dee88ff 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -944,7 +944,7 @@ EXPORT_SYMBOL(start_tty);
 static void tty_update_time(struct timespec *time)
 {
unsigned long sec = get_seconds();
-   sec -= sec % 60;
+   sec -= sec % 10;
if ((long)(sec - time->tv_sec) > 0)
time->tv_sec = sec;
 }
-- 
1.7.10.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ 02/42] TTY: do not update atime/mtime on read/write

2013-04-29 Thread Simon Kirby
On Mon, Apr 29, 2013 at 05:21:17PM -0700, Greg Kroah-Hartman wrote:

> On Mon, Apr 29, 2013 at 05:14:45PM -0700, Simon Kirby wrote:
> > On Mon, Apr 29, 2013 at 12:01:44PM -0700, Greg Kroah-Hartman wrote:
> > 
> > > 3.8-stable review patch.  If anyone has any objections, please let me 
> > > know.
> > 
> > I object. This breaks functionality I use every day (seeing who else is
> > working on stuff with "w").
> > 
> > Furthermore, the patch does not actually fix the hole referenced (see
> > ptmx-keystroke-latency.c on http://vladz.devzero.fr/013_ptmx-timing.php).
> > I can still reproduce the timing capture even with this patch applied
> > (in 3.9-rc8).
> 
> How?  There are no keystrokes being reported to other users, or did we
> miss something with this patch?

wget http://vladz.devzero.fr/svn/codes/PoC/ptmx-keystroke-latency.c
gcc -O ptmx-keystroke-latency ptmx-keystroke-latency.c
./ptmx-keystroke-latency

Log in to another tty, as another user. See keystroke timing. 3.9-rc8.

Seems like it was missed. Meanwhile, idle times in "w" do not update.

> > The grsec patch instead introdues another test within the inotify code
> > (is_sidechannel_device()-related bits) -- untested by me, but probably
> > more relevant.
> > 
> > Even 37b7f3c76595e23257f61bd80b223de8658617ee, the "regression fix",
> > which Linus merged in for the 3.9 release, is still a regression for me.
> 
> And I applied that one as well.

Right, so this restores updates but increases the granularity to 60
seconds. I'm complaining that this is still affects my occupational
performance.

> > 60 seconds means somebody is asleep in my environment, and so is still
> > the kind of thing that just pisses me off. I'd rather revert this whole
> > thing.
> 
> Users taking a break for longer than a minute upset you?  What are you
> really trying to keep track of here?

Really? In a team environment, a person idle for 30 seconds means they've
stopped to look at something else. Now we have to wait 2 minutes to know
if this has happened or not. Now it becomes faster to interrupt somebody
to ask them if maintenance can be done, etc.

> > I'd stand maybe 1 seconds as maximum granularity. You could do that with
> > less code and no test.
> 
> Patch to show this?

I was thinking of just updating the seconds field of the timespec struct,
or leaving this particular part and setting sb->s_time_gran to 1,
though that would probably break other things. Since I've never looked at
this stuff before, I'm not sure I should make a patch, but I can...

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ 02/42] TTY: do not update atime/mtime on read/write

2013-04-29 Thread Simon Kirby
On Mon, Apr 29, 2013 at 12:01:44PM -0700, Greg Kroah-Hartman wrote:

> 3.8-stable review patch.  If anyone has any objections, please let me know.

I object. This breaks functionality I use every day (seeing who else is
working on stuff with "w").

Furthermore, the patch does not actually fix the hole referenced (see
ptmx-keystroke-latency.c on http://vladz.devzero.fr/013_ptmx-timing.php).
I can still reproduce the timing capture even with this patch applied
(in 3.9-rc8).

The grsec patch instead introdues another test within the inotify code
(is_sidechannel_device()-related bits) -- untested by me, but probably
more relevant.

Even 37b7f3c76595e23257f61bd80b223de8658617ee, the "regression fix",
which Linus merged in for the 3.9 release, is still a regression for me.
60 seconds means somebody is asleep in my environment, and so is still
the kind of thing that just pisses me off. I'd rather revert this whole
thing.

I'd stand maybe 1 seconds as maximum granularity. You could do that with
less code and no test.

"watch -n.1 ls --full-time /dev/pts/1" shows that the exposed resolution
(without inotify) is to the nanosecond.

Simon-

> --
> 
> From: Jiri Slaby 
> 
> commit b0de59b5733d18b0d1974a060860a8b5c1b36a2e upstream.
> 
> On http://vladz.devzero.fr/013_ptmx-timing.php, we can see how to find
> out length of a password using timestamps of /dev/ptmx. It is
> documented in "Timing Analysis of Keystrokes and Timing Attacks on
> SSH". To avoid that problem, do not update time when reading
> from/writing to a TTY.
> 
> I am afraid of regressions as this is a behavior we have since 0.97
> and apps may expect the time to be current, e.g. for monitoring
> whether there was a change on the TTY. Now, there is no change. So
> this would better have a lot of testing before it goes upstream.
> 
> References: CVE-2013-0160
> 
> Signed-off-by: Jiri Slaby 
> Signed-off-by: Greg Kroah-Hartman 
> 
> ---
>  drivers/tty/tty_io.c |8 ++--
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> --- a/drivers/tty/tty_io.c
> +++ b/drivers/tty/tty_io.c
> @@ -977,8 +977,7 @@ static ssize_t tty_read(struct file *fil
>   else
>   i = -EIO;
>   tty_ldisc_deref(ld);
> - if (i > 0)
> - inode->i_atime = current_fs_time(inode->i_sb);
> +
>   return i;
>  }
>  
> @@ -1079,11 +1078,8 @@ static inline ssize_t do_tty_write(
>   break;
>   cond_resched();
>   }
> - if (written) {
> - struct inode *inode = file->f_path.dentry->d_inode;
> - inode->i_mtime = current_fs_time(inode->i_sb);
> + if (written)
>   ret = written;
> - }
>  out:
>   tty_write_unlock(tty);
>   return ret;
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Regression with initramfs and nfsroot (appears to be in the dcache)

2012-11-30 Thread Simon Kirby
On Fri, Nov 30, 2012 at 02:00:48AM +, Al Viro wrote:

> OK, that settles it.  WARN_ON() and printks in the area can be dropped;
> the right fix is below.  However, there's a similar place in cifs that
> also needs to be dealt with and I really, really wonder why the hell do
> we do d_drop() in nfs_revalidate_lookup().  It's not relevant in this
> bug, but I would like to understand what's wrong with simply returning
> 0 from ->d_revalidate() and letting the caller (in fs/namei.c) take care
> of unhashing, etc. itself.  Would make have_submounts() in there pointless
> as well - we could just return 0 and let d_invalidate() take care of the
> checks...  Trond?
> 
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -450,7 +450,8 @@ void nfs_prime_dcache(struct dentry *parent, struct 
> nfs_entry *entry)
>   nfs_refresh_inode(dentry->d_inode, entry->fattr);
>   goto out;
>   } else {
> - d_drop(dentry);
> + if (d_invalidate(dentry) != 0)
> + goto out;
>   dput(dentry);
>   }
>   }

Hello,

With your previous patch (with the WARN_ON), I hit the WARN_ON() in the
test case described here: https://patchwork.kernel.org/patch/1446851/ .
The __d_move()ing mountpoint case no longer hits, and there is no longer
an EBUSY, so this seems to work for me (in 3.6, where it broke).

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


run_posix_cpu_timers panic on v2.6.22-rc6

2007-07-01 Thread Simon Kirby
Having recently upgraded our Asterisk server, I figured it would be a
good time to try a NO_HZ kernel.  Everything was running well, until...
it decided to panic.
 
All I have is a fuzzy picture of the console to work from.  The panic
was a fatal exception in interrupt, with EIP within run_posix_cpu_timers.
I can't quite read the offsets, but the stack backtrace was:

run_rebalance_domains
scheduler_tick
tick_periodic
tick_handle_periodic
smp_apic_timer_interrupt
apic_timer_interrupt
default_idle
default_idle
cpu_idle
start_kernel

Seeing as this is all new code and the box has been otherwise stable for
the past 3 years, there is probably a problem stil lurking in the NO_HZ
code somewhere.  But, it looks like I don't have any other info.  I'll
try to get a better shot of the Oops next time...

Simon-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3c590 vs. tulip

2001-05-11 Thread Simon Kirby

On Fri, May 11, 2001 at 09:27:29AM -0400, Dan Mann wrote:

> The server has lots (ok, about 20,000 not counting the os itself) of medium
> sized files on it, ranging in size from 60k to 40MB.  When I run gqview
> (image viewing program) on the client and point to a local directory that is
> mapped to the server using samba, the images (over 4000 in one directory)
> are displayed absolutely as fast as I can click my mouse button.  No lag
> time whatsoever.  How can this be so fast?  Even with the images on my local
> faster machine it is much slower.  Images take at least .5 to 1 second to
> load when they are stored locally.  But over the network, with 2.4.4 and
> samba 2.2, It's as if the server "knows" what I'm going to ask for before I
> actually do.  Is this normal?  I honestly don't think it was this fast when
> server was on 2.2 Kernel with samba 2.07.

Note that the newer gqviews preload the "next" image (next based on your
previous clicking direction).  If you are clicking sequentially and give
it enough time between images, it will immediately display the next image
when you click on it.

I don't see how even if it were any sort of caching bug or something that
gqview would be able to load them that much faster -- it still has to
decode them at one point or another.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



2.4.3 freeze under heavy writing + open rxvt

2001-04-03 Thread Simon Kirby

Three times now I've had 2.4.3 freeze on my dual CPU box while doing a
"dd if=/dev/zero of=/dev/hdc bs=1024k" (a drive to be RMA'd :)).  I got
bored and opened an rxvt, and as the machine was swapping in (I assume),
everything froze.  The mouse still moved for about 5 seconds before the
freeze, and the window was visible as it was attempting to start tcsh.

I'm guessing that what's happening is something is waiting on a lock and
blocking interrupts (?) for five seconds while it is swapping in, and the
NMI lockup detector is kicking in and really breaking it.

I have my serial console plugged in and minicom actually capturing now,
so I'll see if I can get a trace of some sort.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: LDT allocated for cloned task!

2001-03-21 Thread Simon Kirby

On Tue, Mar 20, 2001 at 09:23:14AM -0800, Linus Torvalds wrote:

> It's harmless.
> 
> It's really a warning that says: the mm that you allocated a new LDT for
> may have multiple users, and while the LDT is added to all of them, we
> don't guarantee _when_ the other users will actually see the LDT.
> 
> It so happens that the other users are probably just something like
> "top" or similar, that increment the MM count to make sure that the MM
> doesn't go away while they get information about the process. And those
> users don't care about the LDT in the least.

xmms with the xmms-avi (or avi-xmms?) plugin reproduces the message each
and every time xmms starts up.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4 tcp very slow under certain circumstances (Re: netdev issues (3c905B))

2001-02-26 Thread Simon Kirby

On Wed, Feb 21, 2001 at 03:52:37PM -0800, David S. Miller wrote:

> There is no reason my patch should have this effect.
> 
> All of this is what appears to be a bug in Windows TCP header
> compression, if the ID field of the IPv4 header does not change then
> it drops every other packet.
> 
> The change I posted as-is, is unacceptable because it adds unnecessary
> cost to a fast path.  The final change I actually use will likely
> involve using the TCP sequence numbers to calculate an "always
> changing" ID number in the IPv4 headers to placate these broken
> windows machines.

Has such a patch gone in to the kernel yet?

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4 TCP(?) timeouts

2001-02-16 Thread Simon Kirby

On Fri, Feb 16, 2001 at 07:08:05PM -0500, Simon Kirby wrote:

> Hello,
> 
> Today we put 2.4.1 on our mail server after having see it perform well on
> some other boxes.  It seems now we are receiving a few calls every hour
> from customers reporting that the server tends to hang and eventually
> time out on them when downloading mail.  All customers that have reported
> this problem so far are on a didalup connection.  Apparently the server
> will stop transmitting data (or the client seems to think so), and then
> their mail client will time out.

We recorded a trace on the mail server end to one of the customers having
the problem.  At first they closed the connection because their mail
client was set to a timeout of 1 minute, but then when they changed it to
5 seconds, it seemed to limp along further.  It seems to me just like
there's a huge amount of packet loss, but pinging the machine just after
this shows 0% loss (just occasional jumps in response time).

During this trace, when long periods of nothing went by, "netstat -tan
|grep ip" showed nothing abnormal: a 0 byte receive queue and some
data in the send queue equal to what would be retransmitted and
eventually go through two minutes later.

nmap:
Remote operating system guess: Windows 2000 Professional, Build 2128

16:26:14.738836 < client.1104 > mail.pop3: S 1263956200:1263956200(0) win 8760  (DF)
16:26:14.73 > mail.pop3 > client.1104: S 26894293:26894293(0) ack 1263956201 win 
5840  (DF)
16:26:15.014145 < client.1104 > mail.pop3: . 1:1(0) ack 1 win 9112 (DF)
16:26:15.014866 > mail.pop3 > client.1104: P 1:92(91) ack 1 win 5840 (DF)
16:26:15.291998 < client.1104 > mail.pop3: P 1:16(15) ack 92 win 9021 (DF)
16:26:15.292199 > mail.pop3 > client.1104: . 92:92(0) ack 16 win 5840 (DF)
16:26:15.292305 > mail.pop3 > client.1104: P 92:115(23) ack 16 win 5840 (DF)
16:26:16.686295 > mail.pop3 > client.1104: P 92:115(23) ack 16 win 5840 (DF)
16:26:16.954563 < client.1104 > mail.pop3: P 16:30(14) ack 115 win 8998 (DF)
16:26:16.976908 > mail.pop3 > client.1104: P 115:137(22) ack 30 win 5840 (DF)
16:26:19.776322 > mail.pop3 > client.1104: P 115:137(22) ack 30 win 5840 (DF)
16:26:20.033951 < client.1104 > mail.pop3: P 30:36(6) ack 137 win 8976 (DF)
16:26:20.034063 > mail.pop3 > client.1104: P 137:149(12) ack 36 win 5840 (DF)
16:26:25.626301 > mail.pop3 > client.1104: P 137:149(12) ack 36 win 5840 (DF)
16:26:25.922151 < client.1104 > mail.pop3: P 36:42(6) ack 149 win 8964 (DF)
16:26:25.922254 > mail.pop3 > client.1104: P 149:219(70) ack 42 win 5840 (DF)
16:26:36.949499 < client.1104 > mail.pop3: P 36:42(6) ack 149 win 8964 (DF)
16:26:36.949533 > mail.pop3 > client.1104: . 219:219(0) ack 42 win 5840  (DF)
16:26:37.116302 > mail.pop3 > client.1104: P 149:219(70) ack 42 win 5840 (DF)
16:26:37.380554 < client.1104 > mail.pop3: P 42:50(8) ack 219 win 8894 (DF)
16:26:37.380645 > mail.pop3 > client.1104: . 219:219(0) ack 50 win 5840 (DF)
16:26:37.380709 > mail.pop3 > client.1104: P 219:231(12) ack 50 win 5840 (DF)
16:26:59.567440 < client.1104 > mail.pop3: P 42:50(8) ack 219 win 8894 (DF)
16:26:59.567476 > mail.pop3 > client.1104: . 231:231(0) ack 50 win 5840  (DF)
16:26:59.776301 > mail.pop3 > client.1104: P 219:231(12) ack 50 win 5840 (DF)
16:27:00.043125 < client.1104 > mail.pop3: P 50:59(9) ack 231 win 8882 (DF)
16:27:00.043186 > mail.pop3 > client.1104: . 231:231(0) ack 59 win 5840 (DF)
16:27:00.043475 > mail.pop3 > client.1104: . 231:767(536) ack 59 win 5840 (DF)
16:27:00.043491 > mail.pop3 > client.1104: P 767:1220(453) ack 59 win 5840 (DF)
16:27:44.399831 < client.1104 > mail.pop3: P 50:59(9) ack 231 win 8882 (DF)
16:27:44.399869 > mail.pop3 > client.1104: . 1220:1220(0) ack 59 win 5840  (DF)
16:27:44.836304 > mail.pop3 > client.1104: . 231:767(536) ack 59 win 5840 (DF)
16:27:45.295946 < client.1104 > mail.pop3: . 59:59(0) ack 767 win 9112 (DF)
16:27:45.296003 > mail.pop3 > client.1104: P 767:1220(453) ack 59 win 5840 (DF)
16:29:14.886322 > mail.pop3 > client.1104: P 767:1220(453) ack 59 win 5840 (DF)
16:29:15.264417 < client.1104 > mail.pop3: P 59:67(8) ack 1220 win 8659 (DF)
16:29:15.264479 > mail.pop3 > client.1104: . 1220:1220(0) ack 67 win 5840 (DF)
16:29:15.265127 > mail.pop3 > client.1104: . 1220:1756(536) ack 67 win 5840 (DF)
16:29:15.265145 > mail.pop3 > client.1104: . 1756:2292(536) ack 67 win 5840 (DF)
16:30:45.187652 < client.1104 > mail.pop3: P 59:67(8) ack 1220 win 8659 (DF)
16:30:45.187727 > mail.pop3 > client.1104: . 2292:2292(0) ack 67 win 5840  (DF)
16:31:16.326378 > mail.pop3 > client.1104: . 1220:1756(536) ack 67 win 5840 (DF)
16:31:17.513053 < client.1104 > mail.pop3: . 67:67(0) ack 1756 win 9112 (DF)
16:31:17.513129 > mail.pop3 >

2.4 TCP(?) timeouts

2001-02-16 Thread Simon Kirby

Hello,

Today we put 2.4.1 on our mail server after having see it perform well on
some other boxes.  It seems now we are receiving a few calls every hour
from customers reporting that the server tends to hang and eventually
time out on them when downloading mail.  All customers that have reported
this problem so far are on a didalup connection.  Apparently the server
will stop transmitting data (or the client seems to think so), and then
their mail client will time out.

I noticed that the 2.4.1 on my desktop seems to time out SSH connections
to servers that have become unreachable in about 10 seconds or so, which
is many times faster than 2.2 which used to sit for hours before it timed
out (if it all).  I'm not sure if this is related.  I would expect the
client to attempt to retransmit some ACKs and eventually get some RSTs
back if this were the case.

Has anybody seen similar problems?  The box was previously running
2.2.19pre8 and no customers reported such problems.

We're using cucipop w/ldap on a dual PIII 800 MHz box with 1.5 GB of RAM.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: LDT allocated for cloned task!

2001-02-13 Thread Simon Kirby

On Tue, Feb 13, 2001 at 06:22:26PM +, Alan Cox wrote:

> > LDT allocated for cloned task!
> > 
> > I'm seeing this message come up fairly often while running vanilla
> > 2.4.2-pre3 on my dual Celeron system.  I don't think I saw it before
> > while running 2.4.1, but I may have just missed it.
> 
> Are you running wine or dosemu ?

Actually, I've ran both of them at least a few times this boot.

I think I've found what's doing it...xmms with the avi-xmms plugin will
cause the message to appear at startup even without playing anything. 
Moving the libraries out of the /usr/lib/xmms/Input directory and
starting xmms again will not produce any message.  I only just recently
downloaded this plugin which is probably why I didn't see it before.

It's also happening on my second (non-DRI) head, so it's probably not
related to that (I'll reboot and try again without any DRI modules loaded
and see).

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



LDT allocated for cloned task!

2001-02-13 Thread Simon Kirby

LDT allocated for cloned task!

I'm seeing this message come up fairly often while running vanilla
2.4.2-pre3 on my dual Celeron system.  I don't think I saw it before
while running 2.4.1, but I may have just missed it.

My system has been up around two days and has 11 of these messages in the
ring buffer.

Actually, I just remembered that I'm using the mga DRI driver module
from the DRI CVS tree rather than the built-in module, so that's not part
of the official kernel...maybe that is causing the messages.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



ECN

2001-01-26 Thread Simon Kirby

On Fri, Jan 26, 2001 at 07:14:42AM -0800, David S. Miller wrote:

> Jamie Lokier writes:
>  > Does ECN provide perceived benefits to the node using it?
> 
> Yes, endpoints and intermediate routers can tell the TCP sender about
> congestion instead of TCP having to guess about it based upon observed
> packet drop.
> 
> It is a major enhancement to performance over any WAN.
> 
> The endpoint based congestion notification happens _now_ if both
> sides speak ECN.  The router based notification will be happening
> in the near future as Cisco and others deploy ECN speaking versions of
> their router software.

Hmm... Just wondering: what does TCP then do when it receives this ECN
notification?  Try harder, try less?  Or does it get a specific packet
saying "I dropped your packet", and then the sender retransmits?

I suppose I could go find the RFC...

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Subtle MM bug

2001-01-09 Thread Simon Kirby

On Tue, Jan 09, 2001 at 10:47:57AM -0800, Linus Torvalds wrote:

> And this _is_ a downside, there's no question about it. There's the worry
> about the potential loss of locality, but there's also the fact that you
> effectively need a bigger swap partition with 2.4.x - never mind that
> large portions of the allocations may never be used. You still need the
> disk space for good VM behaviour.
> 
> There are always trade-offs, I think the 2.4.x tradeoff is a good one.

Hmm, perhaps you could clarify...

For boxes that rarely ever use swap with 2.2, will they now need more
swap space on 2.4 to perform well, or just boxes which don't have enough
RAM to handle everything nicely?

I've always been tending to make swap partitions smaller lately, as it
helps in the case where we have to wait for a runaway process to eat up
all of the swap space before it gets killed.  Making the swap size
smaller speeds up the time it takes for this to happen, albeit something
which isn't supposed to happen anyway.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Pentium 4 and 2.4/2.5

2000-11-09 Thread Simon Kirby

On Wed, Nov 08, 2000 at 06:47:40PM +, Alan Cox wrote:

> Ok. Issue settled. So 'rep nop' is safe. Ok that can get into the spinlocks
> for 2.2.18

Just curious... What does "rep nop" actually accomplish, anyway?

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



2.2.17 toasting cache?

2000-11-01 Thread Simon Kirby

Hmm... This seems to be happening every 20 minutes or so on a mail server
here.  This box handles about 25-35 POP3 logins per second and has 1 GB
of RAM (compiled with the kernel at 1GB currently, oops). I have
2.2.18pre15+VM_global on there ready to go, but we haven't rebooted it to
that yet.

The box runs cucipop and exim and has some staff logins etc., but it
doesn't look like any processes are eating up the memory and dumping it
for a number of different reasons.

This will probably word wrap for lots of people...sorry.  "vmstat 1":

   procs  memoryswap  io system cpu  
 r  b  w   swpd   free   buff  cache  si  sobibo   incs  us  sy  id  
26 13  1   1260   5216 125304 695560   0   0   18351  829  1473  62  38   0
18 27  1   1260   2172 125304 696940   0   0   31356  910  1545  77  22   0
23 36  1   1260   2312 124692 693468   0   0   41176  970  2362  76  24   0  
27 53  2   1260 654044  34656 132704   0   0  1773  1652 3881 15430  43  31  26  
(no reponse for at least 30 seconds here)
 8 43 20  39528 857256  19660  17056 388 38332   985 10906 5160 32640   1  65  34
 0 51 17  39704 856308  19688  18408 560 352   40888  586   818   4   7  89  
 0 47 16  39564 854304  19748  21128 420   0   753 0  898  5054   6   9  85
 0 45 16  39144 851136  19808  24640 376   0   914 0 1158 12984   7  10  84

As you can see, it decided to throw out around 700 MB of cache.  I've
been watching top and "vmstat 1" for a while now trying to find out what
does it, but no process ever seems to be eating up memory or anything
when it happens -- it seems to just free all the memory and then the box
just goes very slowly as the RAID array is saturated while it reads back
in all of the mailboxes as people login (417 blocked cucipop processes at
one point... ouch :)).

It doesn't look like anything is slowly eating up the memory (and cache)
and then exiting, because if it were, there would be many more blocked
cucipop processes trying to read back in the mail.  It also doesn't look
like something is quickly eating it up and exiting in a single second,
because I can't even do that if I try with an optimized malloc()-and-dirty
program.

It also looks weird that it kicks out some stuff to swap _after_ all of
the memory becomes free.

This is a dual PIII 700 MHz box, the 2.2.17 kernel has no funky patches
other than one to raise the maximum number of simultaneous
processes/threads (as you can probably guess).  Hmm...it'd be interesting
to try 2.4 on there. ;)

Any ideas?

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: eepro100: card reports no resources [was VM-global...]

2000-10-31 Thread Simon Kirby

On Mon, Oct 30, 2000 at 02:23:56PM +0800, Andrey Savochkin wrote:

> > > > Oct 26 16:38:01 ns29 kernel: eth0: card reports no resources.
> > > > 
> > > let me guess: intel eepro100 or similar??
> > > Well known problem with that one. dont know if its fully fixed ... With
> > 
> > Happens here too, with 2xPPro200, 2.2.18pre17, Eepro100 and light load.
> > The network stalls for several minutes when it happens.
> > 
> > > 2.4.0-test9-pre3 it doesnt happen on my machine ...
> > 
> > What about a fix for a 2.2.x...?
> 
> The exact reason for this problem is still unknown.

We were seeing this on a firewall a week or so ago -- it was actually
coming from some sort of arp flood/loop on the uplink not being caused by
us, and the speed of the incoming arp packets would cause these messages
to occur.

We tried ifconfig up/down, warm reboot, cold reboot, power cycle, card
swapping, and the messages continued.  We stopped the card with a 3c905
and the messages stopped, but "ifconfig" showed Rx overruns at about the
same frequency as the messages used to occur.  This is probably another
way to trigger this error than what most people are seeing.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Simon Kirby

On Wed, Oct 25, 2000 at 12:23:07PM -0500, Jonathan Lemon wrote:

> Consider a program which reads from point A, writes to point B.  If
> the buffer associated with B fills up, then we don't want to continue
> reading from A.
> 
> A/B may be network sockets, pipes, or ptys. 

Fine, but we can bind the event watching to the device or socket or pipe
that will clog up, right?  In which case, we'll later get a write event
(just like with select()), and then once there is some progress you can
go back to read()ing from the original descriptor.  This is even easier
than using select() because you don't have to take the descriptor out of
the read set and put it in the write set temporarily -- it will
automatically work that way.

> Or perhaps you receive a request to use a resource that is currently
> busy.  Does your application want to postpone the request, or read the
> data immediately, even if the request can't be serviced yet?

Assuming this "resource" has a way of waking up the process when it
unclogs, then you can go back and read the remaining data later, which is
what you would want to do anyway.

> My point is that I can easily think of several examples as to where
> this behavior may be beneficial to the application, and I use some of 
> them myself.  You can indeed get the same result by forcing each and
> every application that wants this behavior to implement their own
> tracking mechanism, but this strikes me as error-prone and places an 
> undue burden on the application programmer.

I can see that you could write it this way... I'm just trying to see if
it's really needed. :)

As I wrote in my last email to Jamie, you would need to implement a
tracking mechanism in any case to avoid DoS attacks from clients or a
case where a single client can clog up the reading from any other client. 
And you'd need to take the descriptor out of the read() set in the
select() case anyway, so I don't really see what's different.

> You can find my paper at http://people.freebsd.org/~jlemon

I'll go and read it now. :)

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Simon Kirby

On Wed, Oct 25, 2000 at 07:08:48PM +0200, Jamie Lokier wrote:

> Simon Kirby wrote:
> 
> > What applications would do better by postponing some of the reading? 
> > I can't think of any reason off the top of my head why an application
> > wouldn't want to read everything it can.
> 
> Pipelined server.
> 
> 1. Wait for event.
> 2. Read block
> 3. If EAGAIN, goto 1.
> 4. If next request in block is incomplete, goto 2.
> 5. Process next request in block.
> 6. Write response.
> 7. If EAGAIN, wait until output is ready for writing then goto 6.
> 8. Goto 1 or 2, your choice.
>(Here I'd go to 2 if the last read was complete -- it avoids a
>redundant call to poll()).
> 
> If you simply read everything you can at step 2, you'll run out of
> memory the moment someone sends you 10 requests.
> 
> This doesn't happen if you leave unread data in kernel space --
> TCP windows and all that.

Hmm, I don't understand.

What happens at "wait until output is ready for writing then goto 6"?
You mean you would stop the main loop to wait for a single client to
unclog?  Wouldn't you just do this? ->

1. Wait for event (read and write queued).  Event occurs: Incoming
   data available.
2. Read a block.
3. Process block just read: Does it contain a full request?  If not,
   queue, goto 2, munge together.  If no more data, queue beginning
   of request, if any, and goto 1.
4. Walk over available requests in block just read.  Process.
5. Attempt to write response, if any.
6. Attempted write: Did it all get out?  If not, queue waiting
   writable data and goto 1 to wait for a write event.
7. Goto 2.

Assume we got write clogged.  Some loop later:

10. Wait for event (read and write queued).  Event occurs: Write
space available.
11. Write remaining available data.
12. Attempted write: Did it all get out?  If not, queue remaining
writable data and goto 1 to wait for another write event.
13. Goto 2.

(If we're some sort of forwarding daemon and the receiving end
of our forward has just unclogged, we want to read any readable
data we had waiting.  Same with if we're just answering a
request, though, as the send direction could still get clogged.)

What can't you do here?  What's wrong?  Note that the write event will
let you read any remaining queued data.  If you actually stop from going
back to the main loop when you're write clogged, you will pause the
daemon and create an easy DoS problem.  There's no way around needing to
queue writable data at least.

This is how I wrote my irc daemon a while back, and it works fine with
select().  I can't see what wouldn't work with edge-triggered events
except perhaps the write() event -- I'm not sure what would be considered
"triggered", perhaps when it goes under a watermark or something.  In any
case, it should all still work assuming get_events() offers the ability
to receive "write space available" events.

You don't have to read all data if you don't want to, assuming you will
get another event later that will unclog the situation (meaning the
obstacle must also trigger an event when it is cleared).

In fact, if you did leave the read queued in a daemon using select()
before, you'd keep looping endlessly taking all CPU and never idle
because there would always be read data available.  You'd have to not
queue the descriptor into the read set and instead stick it in the write
set so that you can sleep waiting for the write set to become available,
effectively ignorning any further events on the read set until the write
unclogs.  This sounds just like what would happen if you only got one
notification (edge triggered) in the first place.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Simon Kirby

On Wed, Oct 25, 2000 at 01:02:46AM -0500, Jonathan Lemon wrote:

> Yes, someone pointed me to those today.  I would suggest reading
> some of the relevant literature before embarking on a design.  My
> paper discusses some of the issues, and Mogul/Banga make some good
> points too.
> 
> While an 'edge-trigger' design is indeed simpler, I feel that it 
> ends up making the job of the application harder.  A simple example
> to illustrate the point: what if the application does not choose 
> to read all the data from an incoming packet?  The app now has to 
> implement its own state mechanism to remember that there may be pending
> data in the buffer, since it will not get another event notification
> unless another packet arrives.

What applications would do better by postponing some of the reading? 
I can't think of any reason off the top of my head why an application
wouldn't want to read everything it can.  Doing everything in smaller
chunks would increase overhead (but maybe reduce latencies very slightly
-- albeit probably not much when using a get_events()-style interface).

Isn't it probably better to keep the kernel implementation as efficient
as possible so that the majority of applications which will read (and
write) all data possible can do it as efficiently as possible?  Queueing
up the events, even as they are in the form received from the kernel, is
pretty simple for a userspace program to do, and I think it's the best
place for it.

I know nothing about any other implementations, though, and I'm speaking
mainly from the experiences I've had with coding daemons using select(). 
You mention you wrote a paper discussing this issue...Where could I find
this?

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-25 Thread Simon Kirby

On Tue, Oct 24, 2000 at 04:12:38PM -0700, Dan Kegel wrote:

> With poll(), it was *not a bug* for the user code to drop events; with
> your proposed interface, it *is a bug* for the user code to drop events.
> I'm just emphasizing this because Simon Kirby ([EMAIL PROTECTED]) posted
> incorrectly that your interface "has the same semantics as poll from
> the event perspective".

I missed this because I've never written anything that drops or forgets
events and didn't think about it.  Most programs will read() until EOF is
returned and write() until EAGAIN is returned with non-blocking sockets.
Is there any reason to ignore events other than to slow down response to
some events in favor to others?

I don't see why this is a problem as this interface _isn't_ replacing
select or poll, so it shouldn't matter for existing programs that aren't
converted to use the new interface.

In any case, I think I would prefer that the kernel be optimized for the
common case and leave any strange processing up to userspace so that the
majority of programs which don't need this special case can run as fast
as possible.  Besides, it wouldn't be difficult for a program to stack up
a list of events, even in the same structure as it would get from the
kernel, so that it can process them later.  At least then this data would
be in swappable memory.  Heck, even from an efficiency perspective, it
would be faster for userspace to store the data as it wouldn't keep
getting it returned from a syscall each time...

Am I missing something else?

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Simon Kirby

On Tue, Oct 24, 2000 at 10:03:04AM -0700, Linus Torvalds wrote:

> Basically, with get_events(), there is a maximum of one event per "bind".
> And the memory for that is statically allocated at bind_event() time. 
>... 
> But you'd be doing so in a controlled manner: the memory use wouldn't go
> up just because there is a sudden influx of 5 packets. So it scales
> with load by virtue of simply not _caring_ about the load - it only cares
> about the number of fd's you're waiting on.

Nice.  I like this.

It would be easy for existing userspace code to start using this
interface as it has the same semantics as select/poll from the event
perspective.  But it would make things even easier, as the bind would
follow the life of the descriptor and thus wouldn't need to be "requeued"
before every get_events call, so that part of userspace code could just
be ripped out^W^W disabled and kept only for portability.

In most of the daemons I have written, I've ended up using memcpy() to
keep a non-scribbled-over copy of the fdsets around so I don't have to
walk data structures and requeue fds on every loop for select()...nasty.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Simon Kirby

On Mon, Oct 23, 2000 at 10:39:36PM -0700, Linus Torvalds wrote:

> Actually, forget the mmap, it's not needed.
> 
> Here's a suggested "good" interface that would certainly be easy to
> implement, and very easy to use, with none of the scalability issues that
> many interfaces have.
>...
> Basically, the perfect interface for events would be
> 
>   struct event {
>   unsigned long id;   /* file descriptor ID the event is on */
>   unsigned long event;/* bitmask of active events */
>   };
> 
>   int get_events(struct event * event_array, int maxnr, struct timeval *tmout);

I like. :)

However, isn't there already something like this, albeit maybe without
the ability to return multiple events at a time?  When discussing
select/poll on IRC a while ago with sct, sct said:

  Simon: You just put your sockets into O_NONBLOCK|FASYNC mode for
   SIGIO as usual.
  Simon: Then fcntl(fd, F_SETSIG, rtsignum)
  Simon: And you'll get a signal queue which passes you the fd of
   each SIGIO in turn.
  sct: easy :)
  Simon: You don't even need the overhead of a signal handler:
   instead of select(), you just do "sigwaitinfo(&siginfo, timeout)"
   and it will do a select-style IO wait, returning the fd in the
   siginfo when it's available.

(Captured from IRC on Nov 12th, 1998.)

Or does this menthod still have the overhead of encapsulating the events
into signals within the kernel?

Also, what is different in your above interface that prevents it from
being able to queue up too many events?  I guess the structure is only
sizeof(int) * 2 bytes per fd, so it would only take, say, 80kB for 20,000
FDs on x86, but I don't see how the other method would be significantly
different.  The kernel would have to store the queued events still,
surely...

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [2.4.0-test9-pre5] SCSI still broken, trident/mixer still broken

2000-09-21 Thread Simon Kirby

On Thu, Sep 21, 2000 at 09:39:07PM +0200, Torben Mathiasen wrote:

> Ok, small patch cooked up. Not tested, not compiled. Give
> it a try, and if it works please send it off to Linus.
> I really need to get some work done on a project...

This worked, thanks. :)

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]

> diff -ur --exclude-from=/root/torben /opt/kernel/kernels/linux/drivers/scsi/sg.c 
>linux/drivers/scsi/sg.c
> --- /opt/kernel/kernels/linux/drivers/scsi/sg.c   Thu Sep 21 21:29:44 2000
> +++ linux/drivers/scsi/sg.c   Thu Sep 21 21:35:46 2000
> @@ -1298,18 +1298,18 @@
>  }
>  
>  #ifdef MODULE
> -
>  MODULE_PARM(def_reserved_size, "i");
>  MODULE_PARM_DESC(def_reserved_size, "size of buffer reserved for each fd");
> +#endif
>  
> -int init_module(void) {
> +static int __init init_sg(void) {
>  if (def_reserved_size >= 0)
>   sg_big_buff = def_reserved_size;
>  sg_template.module = THIS_MODULE;
>  return scsi_register_module(MODULE_SCSI_DEV, &sg_template);
>  }
>  
> -void cleanup_module( void)
> +static void __exit exit_sg( void)
>  {
>  #ifdef CONFIG_PROC_FS
>  sg_proc_cleanup();
> @@ -1324,7 +1324,9 @@
>  }
>  sg_template.dev_max = 0;
>  }
> -#endif /* MODULE */
> +
> +module_init(init_sg);
> +module_exit(exit_sg);
>  
>  
>  #if 0

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [2.4.0-test9-pre5] SCSI still broken, trident/mixer still broken

2000-09-21 Thread Simon Kirby

On Thu, Sep 21, 2000 at 02:34:01PM -0400, Douglas Gilbert wrote:

> I do nearly all of my testing with sg as a module.
> So this looks like (another recent) breakage.
> 
> It is beginning to look like the sg driver is not
> (properly) initialized when it is built into the
> kernel. Perhaps you could put a printk in
> sg_init() and sg_attach() to see if they are called.

Actually, I also had a printk in sg_init() and it never got
printed.  I didn't have one in sg_attach, but I can try that.

> > At one point before I followed some of the debug/logging commands listed
> > at the top of sg.c and got an Oops as well...
> 
> Seems as though I've got a lot of retesting to do.

The oops may have been the result of it not being properly initialized or
something...

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [2.4.0-test9-pre5] SCSI still broken, trident/mixer still broken

2000-09-21 Thread Simon Kirby

On Thu, Sep 21, 2000 at 01:12:27PM -0400, Douglas Gilbert wrote:

> Interesting. 'cat /proc/scsi/scsi' should show the same
> devices as 'cat /proc/scsi/sg/device_strs' [and 
> 'cat /proc/scsi/sg/devices']. If not, then the SCSI
> mid-level is not calling sg_detect() [in sg.c] for
> all new scsi devices detected by the mid-level.
> 
> The sg_detect() routine is silent for all devices that
> are "owned" by other upper level drivers (i.e. disks,
> cdroms and tapes) but outputs a line for any other
> scsi type (e.g. scanners which are scsi type 6).

I didn't fiddle with it too much, but I added a printk to sg_detect and
verified it was not getting called at all.  I notice now, however, that I
don't even have a /proc/scsi/sg.  Does that mean it's not getting
initialized at all?  CONFIG_CHR_DEV_SG=y, assuming that's what needs to
be set (config didn't change between kernel versions).

At one point before I followed some of the debug/logging commands listed
at the top of sg.c and got an Oops as well...

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[2.4.0-test9-pre5] SCSI still broken, trident/mixer still broken

2000-09-20 Thread Simon Kirby

Hi n' stuff,

Around 2.4.0-test9-pre2 (or so, definitely in pre3) both my SCSI scanner
and trident sound card stopped being happy.  They are still both broken
in pre5.  On test8, both work perfectly.

On test8:

(scsi0:6:0) Synchronous Data Transfer Request was rejected
  Vendor:   Model: Scanner   Rev: 1.70
  Type:   ScannerANSI SCSI revision: 04
Detected scsi generic sg0 at scsi0, channel 0, id 6, lun 0, type 6
(scsi1:0:3:0) Synchronous at 8.0 Mbyte/sec, offset 31.
  Vendor: YAMAHAModel: CRW4416S  Rev: 1.0e
  Type:   CD-ROM ANSI SCSI revision: 02
Detected scsi CD-ROM sr0 at scsi1, channel 0, id 3, lun 0
scsi : detected 1 SCSI cdrom total.
sr0: scsi3-mmc drive: 16x/16x writer cd/rw xa/form2 cdda tray

... on test9pre5 and test9pre3:

(scsi0:6:0) Synchronous Data Transfer Request was rejected
  Vendor:   Model: Scanner   Rev: 1.70
  Type:   ScannerANSI SCSI revision: 04
(scsi0:0:3:0) Synchronous at 8.0 Mbyte/sec, offset 31.
  Vendor: YAMAHAModel: CRW4416S  Rev: 1.0e
  Type:   CD-ROM ANSI SCSI revision: 02
Detected scsi CD-ROM sr0 at scsi0, channel 0, id 3, lun 0
sr0: scsi3-mmc drive: 16x/16x writer cd/rw xa/form2 cdda tray

("Detected scsi generic..." line missing.)

The trident driver appears to be working, but the mixer (ac97_codec?)
appears to always keep everything muted, even though programs let the
levels be apparently adjusted.  Turning up the volume all the way on my
receiver lets me hear some very faint sound leaking through, which sounds
like a mixer problem instead of a playback problem.  An ALSA CVS snapshot
works fine.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



2.4.0-test9-pre3 boots, 2.4.0-test9-pre4 doesn't (SCSI)

2000-09-19 Thread Simon Kirby

Hello,

2.4.0-test9-pre3 seems to boot and work fine, but 2.4.0-test9-pre4 with
the same .config doesn't.  It stops here:

agpgart: AGP aperture is 64M @ 0xe400
aha152x: processing commandline: ok
aha152x: BIOS test: passed, detected 1 controller(s)
aha152x: resetting bus...
aha152x0: vital data: rev=1, io=0x140 (0x140/0x140), irq=9, scsiid=7, 
reconnect=enabled, parity=enabled, synchronous=enabled, delay=100, extended 
translation=disabled
aha152x0: trying software interrupt, ok.
scsi0 : Adaptec 152x SCSI driver; $Revision: 2.0 $
scsi : 1 host.
(Nothing more.)

Pressing sysreq-p gives me always the same EIP, c01088ed.  System.map:

c01088c0 t default_idle
c01088f4 t poll_idle

This is a dual CPU machine.  Both aha152x and aic7xxx are compiled in,
but I only compiled aha152x in as of test9-pre2 as it seemed to break
when used as a module then (it would loop endlessly detecting my scanner
over and over again infinitely -- it got up to sg50something and I
rebooted).  Perhaps something else is broken that's just showing up
differently now, as the test9-pre3 to test9-pre4 diff is pretty small
and I don't see anything obviously broken.

On test9-pre3, the next lines are:

(scsi1)  found at PCI 0/6/0
(scsi1) Wide Channel, SCSI ID=7, 32/255 SCBs
(scsi1) Downloading sequencer code... 392 instructions downloaded
...etc.

.config.gz attached.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]

 config.gz


Re: Still ext2-corruption in test8-pre5 (incl. OOPS)

2000-09-05 Thread Simon Kirby

On Wed, Sep 06, 2000 at 02:55:29AM +0200, Udo A. Steinberg wrote:

> >>EIP; c0130400 <__block_commit_write+50/c0>   <=

Just got the same Oops with test8-pre5 while exiting mutt:

Writing /var/spool/mail/sim...Unable to handle kernel NULL pointer dereference at 
virtual address 0018
 printing eip:
c0130583
*pde = 
Oops: 
CPU:0
EIP:0010:[]
EFLAGS: 00010293
eax:    ebx:    ecx: 0800   edx: 
esi: 0800   edi: 0001   ebp:    esp: ceb19e40
ds: 0018   es: 0018   ss: 0018
Process mutt (pid: 2153, stackpage=ceb19000)
Stack: c1382a80  ce0ab000 0649  0800 c0130b52 ceb640a0
   c1382a80 09b7 1000 0dea  000b ceb640a0 ceb640a0
   09b7 c014d31e ceb6413c 006f49b7  0649 ceb640a0 ceb6413c
Call Trace: [] [] [] [] [] 
[] []
   [] [] [] []
Code: 8b 43 18 83 e0 01 0f 44 ef eb 35 89 f6 f6 43 18 10 74 2d f0

>>EIP; c0130583 <__block_commit_write+43/c0>   <=
Trace; c0130b52 
Trace; c014d31e 
Trace; c012336d 
Trace; c012156d 
Trace; c012168d 
Trace; c014189e 
Trace; c014198e 
Trace; c012c71a 
Trace; c0124608 
Trace; c012c963 
Trace; c010a65f 
Code;  c0130583 <__block_commit_write+43/c0>
 <_EIP>:
Code;  c0130583 <__block_commit_write+43/c0>   <=
   0:   8b 43 18  mov0x18(%ebx),%eax   <=
Code;  c0130586 <__block_commit_write+46/c0>
   3:   83 e0 01  and$0x1,%eax
Code;  c0130589 <__block_commit_write+49/c0>
   6:   0f 44 ef  cmove  %edi,%ebp
Code;  c013058c <__block_commit_write+4c/c0>
   9:   eb 35 jmp40 <_EIP+0x40> c01305c3 
<__block_commit_write+83/c0>
Code;  c013058e <__block_commit_write+4e/c0>
   b:   89 f6 mov%esi,%esi
Code;  c0130590 <__block_commit_write+50/c0>
   d:   f6 43 18 10   testb  $0x10,0x18(%ebx)
Code;  c0130594 <__block_commit_write+54/c0>
  11:   74 2d je 40 <_EIP+0x40> c01305c3 
<__block_commit_write+83/c0>
Code;  c0130596 <__block_commit_write+56/c0>
  13:   f0 00 00  lock add %al,(%eax)

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[Danger] Re: test8-pre4: innd fixed?

2000-09-04 Thread Simon Kirby

On Mon, Sep 04, 2000 at 09:09:43PM -0700, Linus Torvalds wrote:

> On Mon, 4 Sep 2000, Mohammad A. Haque wrote:
> >
> > Is this file corruption 'thing' specific to innd or is it the same
> > problem reported with corrupt mailboxes with pre2 and high disk
> > activity?
> 
> The mailbox corruption thread is at least partly due to a pine bug that is
> triggered by a bugtraq posting.
> 
> The truncate issue is unrelated to that, but may certainly show up on
> mailboxes too. 

There is something definitely now even more broken with test8pre4.

I just upgraded to test8pre4 from test7 and was reading this and some
other emails with mutt.  Upon quiting mutt, mutt reported that there was
some sort of error while attempting to write the folder.  My folder now
looks like this:

<1073152 bytes of the start of original folder>
<67045376 bytes of NULL (0x00)>
<51704 bytes of the end of the original folder>

Obviously, the folder was in need of some pruning to begin with, but this
pruned a bit more than I would have liked.

I'm not exactly sure how this happened, but it definitely didn't happen
before with test7.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/