from:"Dean Gaudet"

[Bug 176902]

2011-01-10 Thread dean gaudet

yeah let's not fix this before it's a decade old.  not much longer to
wait!

-- 
You received this bug notification because you are a member of Kubuntu
Bugs, which is subscribed to kdegraphics in ubuntu.
https://bugs.launchpad.net/bugs/176902

Title:
  kpdf locks sound output

-- 
kubuntu-bugs mailing list
kubuntu-b...@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/kubuntu-bugs

[Bug 102408]

2011-01-10 Thread dean gaudet

yeah let's not fix this before it's a decade old.  not much longer to
wait!

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/102408

Title:
  Helper apps inherit open file descriptors

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 159258]

2011-01-09 Thread dean gaudet

yeah let's not fix this before it's a decade old.  not much longer to
wait!

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is a direct subscriber.
https://bugs.launchpad.net/bugs/159258

Title:
  Helper applications launched by Firefox inherit ALL file descriptors

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Bug#506707: me too

2009-09-12 Thread dean gaudet

this is a fairly serious regression.

-dean



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org

Bug#495820: FTBS: make[1]: *** No rule to make target `txt'. Stop.

2008-08-20 Thread dean gaudet

Package: iproute
Version: 20080725-2

i did:

sudo apt-get build-dep iproute
apt-get source iproute
cd iproute-20080725-2
fakeroot ./debian/rules binary

and it fails:

...
/usr/share/texmf-texlive/dvips/base/texps.pro
/usr/share/texmf-texlive/dvips/base/special.pro
/usr/share/texmf-texlive/dvips/base/color.pro.
/usr/share/texmf-texlive/fonts/type1/bluesky/cm/cmsy10.pfb[1]
make[1]: *** No rule to make target `txt'.  Stop.
make[1]: Leaving directory `/var/src/iproute2/iproute-20080725/doc'
make: *** [stamp-doc] Error 2

if i remove the txt from the make -C doc line in debian/rules the build 
completes successfully.

is there some other missing build-dep which makes that work?

thanks
-dean



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Bug#493635: really ignore /etc/network/options

2008-08-03 Thread dean gaudet

package: netbase
version: 4.33

spot the bug in /etc/init.d/networking:

process_options() {
[ -e /etc/network/options ] || return 0
log_warning_msg /etc/network/options still exists and it will be IGNORED! 
Read README.Debian of netbase.
}

there should be a return 0 after the log_warning_msg... without it
/etc/init.d/networking aborts if there is a /etc/network/options file
and all hell breaks loose.

-dean



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Re: valgrind and openssl

2008-05-20 Thread dean gaudet

On Tue, 20 May 2008, Richard Salz wrote:

  on the other hand it may be a known plaintext attack.
 
 Using those words in this context makes it sound that you not only don't 
 understand what is being discussed right here and now, but also that you 
 don't understand the term you just used. Are you sure you understood, 
 e.g., Ted Tso's postings in this thread? Perhaps I'm missing something, 
 but can you show me something that talks about known plaintext attacks in 
 the context of hashing/digests?

yes i abused the term.

the so-called uninitialized data is actually from the stack right?  an 
attacker generally controls that (i.e. earlier use of the stack probably 
includes char buf[] which is controllable).  i don't know what ordering 
the entropy is added to the PRNG, but if all the useful entropy goes in 
first then an attacker might get to control the last 1KiB passed through 
the SHA1.

yes it's unlikely given what we know today that an attacker could 
manipulate the state down to a sufficiently small number of outputs, but i 
really don't see the point of letting an attacker have that sort of 
control.

-dean
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   [EMAIL PROTECTED]

Re: valgrind and openssl

2008-05-19 Thread dean gaudet

On Thu, 15 May 2008, Geoff Thorpe wrote:

 I forgot to mention something;
 
  On Thursday 15 May 2008 12:38:24 John Parker wrote:
 It is already possible to use openssl and valgrind - just build
 OpenSSL with -DPURIFY, and it is quite clean.
  
   Actually on my system, just -DPURIFY doesn't satisfy valgrind.  What
   I'm asking for is something that both satisfies valgrind and doesn't
   reduce the keyspace.
 
  If you're using an up-to-date version of openssl when you see this (ie. a
  recent CVS snapshot from our website, even if it's from a stable branch for
  compatibility reasons), then please post details. -DPURIFY exists to
  facilitate debuggers that don't like reading uninitialised data, so if
  that's not the case then please provide details. Note however that there
  are a variety of gotchas that allow you to create little leaks if you're
  not careful, and valgrind could well be complaining about those instead.
 
 Note that you should always build with no-asm if you're doing this kind of 
 debug analysis. The assembly optimisations are likely to operate at 
 granularities and in ways that valgrind could easily complain about. I don't 
 know that this is the case, but it would certainly make sense to compare 
 before posting a bug report.

you know, this is sheer stupidity.

you're suggesting that testing the no-asm code is a valid way of testing 
the assembly code?

additionally the suggestion of -DPURIFY as a way of testing the code is 
also completely broken software engineering practice.

any special case changes for testing means you're not testing the REAL 
CODE.

for example if you build -DPURIFY then you also won't get notified of 
problems with other PRNG seeds which are supposed to be providing random 
*initialized* data.  not to mention that a system compiled that way is 
insecure -- so you either have to link your binaries static (to avoid the 
danger of an insecure shared lib), or set up a chroot for testing.

in any event YOU'RE NOT TESTING THE REAL CODE.  which is to say you're 
wasting your time if you test under any of these conditions.

openssl should not be relying on uninitialized data for anything.  even if 
it doesn't matter from the point of view of the PRNG, it should be pretty 
damn clear it's horrible software engineering practice.

-dean
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   [EMAIL PROTECTED]

Re: valgrind and openssl

2008-05-19 Thread dean gaudet



On Thu, 15 May 2008, Bodo Moeller wrote:

 On Thu, May 15, 2008 at 11:41 PM, Erik de Castro Lopo
 [EMAIL PROTECTED] wrote:
  Goetz Babin-Ebell wrote:
 
  But here the use of this uninitialized data is intentional
  and the programmer are very well aware of what they did.
 
  The use of unititialized data in this case is stupid because the
  entropy of this random data is close to zero.
 
 It may be zero, but it may be more, depending on what happened earlier
 in the program if the same memory locations have been in use before.
 This may very well include data that would be unpredictable to
 adversaries -- i.e., entropy; that's the point here.

on the other hand it may be a known plaintext attack.

what are you guys smoking?

-dean
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   [EMAIL PROTECTED]

RE: valgrind and openssl

2008-05-19 Thread dean gaudet

On Mon, 19 May 2008, David Schwartz wrote:

 
  any special case changes for testing means you're not testing the REAL
  CODE.
 
 You mean you're not testing *all* of the real code. That's fine, you can't
 debug everythign at once.

if you haven't tested your final production binary then you haven't tested 
anything at all.


 Good luck finding people who agree with you. I've been a professional
 software developer for about 18 years and I've worked on debugging with

i've been a professional for longer than you.  big whoop.

-dean
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   [EMAIL PROTECTED]

Bug#481754: no option for specifying syslog facility

2008-05-18 Thread dean gaudet

Package: fail2ban
Version: 0.8.2-3

fail2ban 0.6 supported a syslog-facility config option which controlled 
the facility for syslog messages... 0.8.2-3 does not support this.  i had 
to edit /usr/share/fail2ban/server/server.py in order to change LOG_DAEMON 
to LOG_AUTH.

-dean



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Bug#481760: Failed none causes false triggers

2008-05-18 Thread dean gaudet

Package: fail2ban
Version: 0.8.2-3

when connecting with ssh keys, no password, sshd logs:

May 18 05:08:45 twinlark sshd[5681]: Failed none for dean from 10.1.1.1 port 
37262 ssh2
May 18 05:08:45 twinlark sshd[5681]: Found matching RSA key: 
May 18 05:08:45 twinlark sshd[5681]: Found matching RSA key: 
May 18 05:08:45 twinlark sshd[5681]: Accepted publickey for dean from 10.1.1.1 
port 37262 ssh2

and fail2ban considers the Failed none to be an attack... enough 
successful logins like this and the IP is banned.  this is broken.

best fix i can see is to be more explicit about the
/etc/fail2ban/filter.d/sshd.conf filters, such as:

^%(__prefix_line)sFailed password for .* from HOST(?: port 
\d*)?(?: ssh\d*)?$
^%(__prefix_line)sFailed publickey for .* from HOST(?: port 
\d*)?(?: ssh\d*)?$

-dean



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Bug#479530: confim on error

2008-05-05 Thread dean gaudet

Package: apt-listchanges
Version: 2.82

when apt-listchanges encounters an error (such as the now infamous 
database /var/lib/apt/listchanges.db failed to load. error) it continues 
without confirmation even if confirm=1 is in the etc file.  i think 
apt-listchanges should always ask for confirmation when confirm=1 is set.

-dean



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Re: [PATCH -mm crypto] AES: x86_64 asm implementation optimization

2008-05-04 Thread dean gaudet

one of the more important details in evaluating these changes would be the 
family/model/stepping of the processors being microbenchmarked... could 
you folks include /proc/cpuinfo with the results?

also -- please drop the #define for R16 to %rsp ... it obfuscates more 
than it helps anything.

thanks
-dean

On Wed, 30 Apr 2008, Sebastian Siewior wrote:

 * Huang, Ying | 2008-04-25 11:11:17 [+0800]:
 
 Hi, Sebastian,
 Hi Huang,
 
 sorry for the delay.
 
 I changed the patches to group the read or write together instead of
 interleaving. Can you help me to test these new patches? The new patches
 is attached with the mail.
 The new results are attached.
 
 
 Best Regards,
 Huang Ying
 
 Sebastian
 
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: system without RAM on node0 boot fail

2008-02-01 Thread dean gaudet

actually yeah i've seen this... in a bizarre failure situation in a system 
which physically had RAM in the boot node but it was never enumerated for 
the kernel (other nodes had RAM which was enumerated).

so technically there was boot node RAM but the kernel never saw it.

-dean

On Wed, 30 Jan 2008, Christoph Lameter wrote:

> x86 supports booting from a node without RAM?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.24] x86: add sysfs interface for cpuid module

2008-02-01 Thread dean gaudet

why do we need another kernel cpuid reading method when sched_setaffinity 
exists and cpuid is available in ring3?

-dean
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: add PCI IDs to k8topology_64.c II

2008-02-01 Thread dean gaudet

On Tue, 29 Jan 2008, Andi Kleen wrote:

> > SRAT is essentially just a two dimensional table with node distances.
> 
> Sorry, that was actually SLIT. SRAT is not two dimensional, but also
> relatively simple. SLIT you don't really need to implement.

yeah but i'd heartily recommend implementing SLIT too.  mind you it's 
almost universal non-existence means i've had to resort to userland 
measurements to determine node distances and that won't change.  i guess i 
just wanted to grumble somewhere.

-dean
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: add PCI IDs to k8topology_64.c II

2008-02-01 Thread dean gaudet

On Tue, 29 Jan 2008, Andi Kleen wrote:

  SRAT is essentially just a two dimensional table with node distances.
 
 Sorry, that was actually SLIT. SRAT is not two dimensional, but also
 relatively simple. SLIT you don't really need to implement.

yeah but i'd heartily recommend implementing SLIT too.  mind you it's 
almost universal non-existence means i've had to resort to userland 
measurements to determine node distances and that won't change.  i guess i 
just wanted to grumble somewhere.

-dean
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.24] x86: add sysfs interface for cpuid module

2008-02-01 Thread dean gaudet

why do we need another kernel cpuid reading method when sched_setaffinity 
exists and cpuid is available in ring3?

-dean
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: system without RAM on node0 boot fail

2008-02-01 Thread dean gaudet

actually yeah i've seen this... in a bizarre failure situation in a system 
which physically had RAM in the boot node but it was never enumerated for 
the kernel (other nodes had RAM which was enumerated).

so technically there was boot node RAM but the kernel never saw it.

-dean

On Wed, 30 Jan 2008, Christoph Lameter wrote:

 x86 supports booting from a node without RAM?
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rdiff-backup-users] can rdiff-backup be stopped / paused / restarted? - HOWTO?

2008-01-25 Thread dean gaudet

On Mon, 14 Jan 2008, Dave Kempe wrote:

 Lexje wrote:
  I'm completely new to rdiff-backup.
  I'm trying to backup a complete server over the internet. Is it possible to
  pause, stop / restart rdiff-backup? (To free up / respect
  bandwith limitations)
 
 You could do a Ctrl-Z and then start it again with fg
 you could use screen as well

or use kill -STOP and kill -CONT ... and pray the ssh connection isn't 
dropped.

-dean


___
rdiff-backup-users mailing list at rdiff-backup-users@nongnu.org
http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki

Re: Fast network file copy; "recvfile()" ?

2008-01-21 Thread dean gaudet

On Thu, 17 Jan 2008, Patrick J. LoPresti wrote:

> I need to copy large (> 100GB) files between machines on a fast
> network.  Both machines have reasonably fast disk subsystems, with
> read/write performance benchmarked at > 800 MB/sec. Using 10GigE cards
> and the usual tweaks to tcp_rmem etc., I am getting single-stream TCP
> throughput better than 600 MB/sec.
> 
> My question is how best to move the actual file.  NFS writes appear to
> max out at a little over 100 MB/sec on this configuration.

did your "usual tweaks" include mounting with -o tcp,rsize=262144,wsize=262144 ?

i should have kept better notes last time i was experimenting with this,
but from memory here's what i found:

- if i used three NFS clients and was reading from page cache on the
  server i hit 1.2GB/s total throughput from the server.  the client
  NFS code was maxing out one CPU on each of the client machines.

- disk subsystem (sw raid10 far2) was capable of 600MB/s+ when read
  locally on the NFS server, but topped out around ~250MB/s when read
  remotely (no matter how many clients).

my workload was read-intensive so i didn't experiment with writes...

-dean
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Fast network file copy; recvfile() ?

2008-01-21 Thread dean gaudet

On Thu, 17 Jan 2008, Patrick J. LoPresti wrote:

 I need to copy large ( 100GB) files between machines on a fast
 network.  Both machines have reasonably fast disk subsystems, with
 read/write performance benchmarked at  800 MB/sec. Using 10GigE cards
 and the usual tweaks to tcp_rmem etc., I am getting single-stream TCP
 throughput better than 600 MB/sec.
 
 My question is how best to move the actual file.  NFS writes appear to
 max out at a little over 100 MB/sec on this configuration.

did your usual tweaks include mounting with -o tcp,rsize=262144,wsize=262144 ?

i should have kept better notes last time i was experimenting with this,
but from memory here's what i found:

- if i used three NFS clients and was reading from page cache on the
  server i hit 1.2GB/s total throughput from the server.  the client
  NFS code was maxing out one CPU on each of the client machines.

- disk subsystem (sw raid10 far2) was capable of 600MB/s+ when read
  locally on the NFS server, but topped out around ~250MB/s when read
  remotely (no matter how many clients).

my workload was read-intensive so i didn't experiment with writes...

-dean
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5

2008-01-15 Thread dean gaudet

On Tue, 15 Jan 2008, Andrew Morton wrote:

> On Tue, 15 Jan 2008 21:01:17 -0800 (PST) dean gaudet <[EMAIL PROTECTED]> 
> wrote:
> 
> > On Mon, 14 Jan 2008, NeilBrown wrote:
> > 
> > > 
> > > raid5's 'make_request' function calls generic_make_request on
> > > underlying devices and if we run out of stripe heads, it could end up
> > > waiting for one of those requests to complete.
> > > This is bad as recursive calls to generic_make_request go on a queue
> > > and are not even attempted until make_request completes.
> > > 
> > > So: don't make any generic_make_request calls in raid5 make_request
> > > until all waiting has been done.  We do this by simply setting
> > > STRIPE_HANDLE instead of calling handle_stripe().
> > > 
> > > If we need more stripe_heads, raid5d will get called to process the
> > > pending stripe_heads which will call generic_make_request from a
> > > different thread where no deadlock will happen.
> > > 
> > > 
> > > This change by itself causes a performance hit.  So add a change so
> > > that raid5_activate_delayed is only called at unplug time, never in
> > > raid5.  This seems to bring back the performance numbers.  Calling it
> > > in raid5d was sometimes too soon...
> > > 
> > > Cc: "Dan Williams" <[EMAIL PROTECTED]>
> > > Signed-off-by: Neil Brown <[EMAIL PROTECTED]>
> > 
> > probably doesn't matter, but for the record:
> > 
> > Tested-by: dean gaudet <[EMAIL PROTECTED]>
> > 
> > this time i tested with internal and external bitmaps and it survived 8h 
> > and 14h resp. under the parallel tar workload i used to reproduce the 
> > hang.
> > 
> > btw this should probably be a candidate for 2.6.22 and .23 stable.
> > 
> 
> hm, Neil said
> 
>   The first fixes a bug which could make it a candidate for 24-final. 
>   However it is a deadlock that seems to occur very rarely, and has been in
>   mainline since 2.6.22.  So letting it into one more release shouldn't be
>   a big problem.  While the fix is fairly simple, it could have some
>   unexpected consequences, so I'd rather go for the next cycle.
> 
> food fight!
> 

heheh.

it's really easy to reproduce the hang without the patch -- i could
hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB.
i'll try with ext3... Dan's experiences suggest it won't happen with ext3
(or is even more rare), which would explain why this has is overall a
rare problem.

but it doesn't result in dataloss or permanent system hangups as long
as you can become root and raise the size of the stripe cache...

so OK i agree with Neil, let's test more... food fight over! :)

-dean
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5

2008-01-15 Thread dean gaudet

On Mon, 14 Jan 2008, NeilBrown wrote:

> 
> raid5's 'make_request' function calls generic_make_request on
> underlying devices and if we run out of stripe heads, it could end up
> waiting for one of those requests to complete.
> This is bad as recursive calls to generic_make_request go on a queue
> and are not even attempted until make_request completes.
> 
> So: don't make any generic_make_request calls in raid5 make_request
> until all waiting has been done.  We do this by simply setting
> STRIPE_HANDLE instead of calling handle_stripe().
> 
> If we need more stripe_heads, raid5d will get called to process the
> pending stripe_heads which will call generic_make_request from a
> different thread where no deadlock will happen.
> 
> 
> This change by itself causes a performance hit.  So add a change so
> that raid5_activate_delayed is only called at unplug time, never in
> raid5.  This seems to bring back the performance numbers.  Calling it
> in raid5d was sometimes too soon...
> 
> Cc: "Dan Williams" <[EMAIL PROTECTED]>
> Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

probably doesn't matter, but for the record:

Tested-by: dean gaudet <[EMAIL PROTECTED]>

this time i tested with internal and external bitmaps and it survived 8h 
and 14h resp. under the parallel tar workload i used to reproduce the 
hang.

btw this should probably be a candidate for 2.6.22 and .23 stable.

thanks
-dean
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5

2008-01-15 Thread dean gaudet

On Mon, 14 Jan 2008, NeilBrown wrote:

 
 raid5's 'make_request' function calls generic_make_request on
 underlying devices and if we run out of stripe heads, it could end up
 waiting for one of those requests to complete.
 This is bad as recursive calls to generic_make_request go on a queue
 and are not even attempted until make_request completes.
 
 So: don't make any generic_make_request calls in raid5 make_request
 until all waiting has been done.  We do this by simply setting
 STRIPE_HANDLE instead of calling handle_stripe().
 
 If we need more stripe_heads, raid5d will get called to process the
 pending stripe_heads which will call generic_make_request from a
 different thread where no deadlock will happen.
 
 
 This change by itself causes a performance hit.  So add a change so
 that raid5_activate_delayed is only called at unplug time, never in
 raid5.  This seems to bring back the performance numbers.  Calling it
 in raid5d was sometimes too soon...
 
 Cc: Dan Williams [EMAIL PROTECTED]
 Signed-off-by: Neil Brown [EMAIL PROTECTED]

probably doesn't matter, but for the record:

Tested-by: dean gaudet [EMAIL PROTECTED]

this time i tested with internal and external bitmaps and it survived 8h 
and 14h resp. under the parallel tar workload i used to reproduce the 
hang.

btw this should probably be a candidate for 2.6.22 and .23 stable.

thanks
-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5

2008-01-15 Thread dean gaudet

On Tue, 15 Jan 2008, Andrew Morton wrote:

 On Tue, 15 Jan 2008 21:01:17 -0800 (PST) dean gaudet [EMAIL PROTECTED] 
 wrote:
 
  On Mon, 14 Jan 2008, NeilBrown wrote:
  
   
   raid5's 'make_request' function calls generic_make_request on
   underlying devices and if we run out of stripe heads, it could end up
   waiting for one of those requests to complete.
   This is bad as recursive calls to generic_make_request go on a queue
   and are not even attempted until make_request completes.
   
   So: don't make any generic_make_request calls in raid5 make_request
   until all waiting has been done.  We do this by simply setting
   STRIPE_HANDLE instead of calling handle_stripe().
   
   If we need more stripe_heads, raid5d will get called to process the
   pending stripe_heads which will call generic_make_request from a
   different thread where no deadlock will happen.
   
   
   This change by itself causes a performance hit.  So add a change so
   that raid5_activate_delayed is only called at unplug time, never in
   raid5.  This seems to bring back the performance numbers.  Calling it
   in raid5d was sometimes too soon...
   
   Cc: Dan Williams [EMAIL PROTECTED]
   Signed-off-by: Neil Brown [EMAIL PROTECTED]
  
  probably doesn't matter, but for the record:
  
  Tested-by: dean gaudet [EMAIL PROTECTED]
  
  this time i tested with internal and external bitmaps and it survived 8h 
  and 14h resp. under the parallel tar workload i used to reproduce the 
  hang.
  
  btw this should probably be a candidate for 2.6.22 and .23 stable.
  
 
 hm, Neil said
 
   The first fixes a bug which could make it a candidate for 24-final. 
   However it is a deadlock that seems to occur very rarely, and has been in
   mainline since 2.6.22.  So letting it into one more release shouldn't be
   a big problem.  While the fix is fairly simple, it could have some
   unexpected consequences, so I'd rather go for the next cycle.
 
 food fight!
 

heheh.

it's really easy to reproduce the hang without the patch -- i could
hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB.
i'll try with ext3... Dan's experiences suggest it won't happen with ext3
(or is even more rare), which would explain why this has is overall a
rare problem.

but it doesn't result in dataloss or permanent system hangups as long
as you can become root and raise the size of the stripe cache...

so OK i agree with Neil, let's test more... food fight over! :)

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5

2008-01-15 Thread dean gaudet

On Mon, 14 Jan 2008, NeilBrown wrote:

 
 raid5's 'make_request' function calls generic_make_request on
 underlying devices and if we run out of stripe heads, it could end up
 waiting for one of those requests to complete.
 This is bad as recursive calls to generic_make_request go on a queue
 and are not even attempted until make_request completes.
 
 So: don't make any generic_make_request calls in raid5 make_request
 until all waiting has been done.  We do this by simply setting
 STRIPE_HANDLE instead of calling handle_stripe().
 
 If we need more stripe_heads, raid5d will get called to process the
 pending stripe_heads which will call generic_make_request from a
 different thread where no deadlock will happen.
 
 
 This change by itself causes a performance hit.  So add a change so
 that raid5_activate_delayed is only called at unplug time, never in
 raid5.  This seems to bring back the performance numbers.  Calling it
 in raid5d was sometimes too soon...
 
 Cc: Dan Williams [EMAIL PROTECTED]
 Signed-off-by: Neil Brown [EMAIL PROTECTED]

probably doesn't matter, but for the record:

Tested-by: dean gaudet [EMAIL PROTECTED]

this time i tested with internal and external bitmaps and it survived 8h 
and 14h resp. under the parallel tar workload i used to reproduce the 
hang.

btw this should probably be a candidate for 2.6.22 and .23 stable.

thanks
-dean
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5

2008-01-15 Thread dean gaudet

On Tue, 15 Jan 2008, Andrew Morton wrote:

 On Tue, 15 Jan 2008 21:01:17 -0800 (PST) dean gaudet [EMAIL PROTECTED] 
 wrote:
 
  On Mon, 14 Jan 2008, NeilBrown wrote:
  
   
   raid5's 'make_request' function calls generic_make_request on
   underlying devices and if we run out of stripe heads, it could end up
   waiting for one of those requests to complete.
   This is bad as recursive calls to generic_make_request go on a queue
   and are not even attempted until make_request completes.
   
   So: don't make any generic_make_request calls in raid5 make_request
   until all waiting has been done.  We do this by simply setting
   STRIPE_HANDLE instead of calling handle_stripe().
   
   If we need more stripe_heads, raid5d will get called to process the
   pending stripe_heads which will call generic_make_request from a
   different thread where no deadlock will happen.
   
   
   This change by itself causes a performance hit.  So add a change so
   that raid5_activate_delayed is only called at unplug time, never in
   raid5.  This seems to bring back the performance numbers.  Calling it
   in raid5d was sometimes too soon...
   
   Cc: Dan Williams [EMAIL PROTECTED]
   Signed-off-by: Neil Brown [EMAIL PROTECTED]
  
  probably doesn't matter, but for the record:
  
  Tested-by: dean gaudet [EMAIL PROTECTED]
  
  this time i tested with internal and external bitmaps and it survived 8h 
  and 14h resp. under the parallel tar workload i used to reproduce the 
  hang.
  
  btw this should probably be a candidate for 2.6.22 and .23 stable.
  
 
 hm, Neil said
 
   The first fixes a bug which could make it a candidate for 24-final. 
   However it is a deadlock that seems to occur very rarely, and has been in
   mainline since 2.6.22.  So letting it into one more release shouldn't be
   a big problem.  While the fix is fairly simple, it could have some
   unexpected consequences, so I'd rather go for the next cycle.
 
 food fight!
 

heheh.

it's really easy to reproduce the hang without the patch -- i could
hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB.
i'll try with ext3... Dan's experiences suggest it won't happen with ext3
(or is even more rare), which would explain why this has is overall a
rare problem.

but it doesn't result in dataloss or permanent system hangups as long
as you can become root and raise the size of the stripe cache...

so OK i agree with Neil, let's test more... food fight over! :)

-dean
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

nosmp/maxcpus=0 or 1 -> TSC unstable

2008-01-12 Thread dean gaudet

if i boot an x86 64-bit 2.6.24-rc7 kernel with nosmp, maxcpus=0 or 1 it 
still disables TSC :)

Marking TSC unstable due to TSCs unsynchronized

this is an opteron 2xx box which does have two cpus and no clock-divide in 
halt or cpufreq enabled so TSC should be fine with only one cpu.

pretty sure this is the culprit is that num_possible_cpus() > 1, which 
would mean cpu_possible_map contains the second cpu... but i'm not quite 
sure what the right fix is... or perhaps this is all intended.

-dean
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

nosmp/maxcpus=0 or 1 - TSC unstable

2008-01-12 Thread dean gaudet

if i boot an x86 64-bit 2.6.24-rc7 kernel with nosmp, maxcpus=0 or 1 it 
still disables TSC :)

Marking TSC unstable due to TSCs unsynchronized

this is an opteron 2xx box which does have two cpus and no clock-divide in 
halt or cpufreq enabled so TSC should be fine with only one cpu.

pretty sure this is the culprit is that num_possible_cpus()  1, which 
would mean cpu_possible_map contains the second cpu... but i'm not quite 
sure what the right fix is... or perhaps this is all intended.

-dean
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: CPA patchset

2008-01-11 Thread dean gaudet

On Fri, 11 Jan 2008, dean gaudet wrote:

> On Fri, 11 Jan 2008, Ingo Molnar wrote:
> 
> > * Andi Kleen <[EMAIL PROTECTED]> wrote:
> > 
> > > Cached requires the cache line to be read first before you can write 
> > > it.
> > 
> > nonsense, and you should know it. It is perfectly possible to construct 
> > fully written cachelines, without reading the cacheline first. MOVDQ is 
> > SSE1 so on basically in every CPU today - and it is 16 byte aligned and 
> > can generate full cacheline writes, _without_ filling in the cacheline 
> > first.
> 
> did you mean to write MOVNTPS above?

btw in case you were thinking a normal store to WB rather than a 
non-temporal store... i ran a microbenchmark streaming stores to every 16 
bytes of a 16MiB region aligned to 4096 bytes on a xeon 53xx series CPU 
(4MiB L2) + 5000X northbridge and the avg latency of MOVNTPS is 12 cycles 
whereas the avg latency of MOVAPS is 20 cycles.

the inner loop is unrolled 16 times so there are literally 4 cache lines 
worth of stores being stuffed into the store queue as fast as possible... 
and there's no coalescing for normal stores even on this modern CPU.

i'm certain i'll see the same thing on AMD... it's a very hard thing to do 
in hardware without the non-temporal hint.

-dean

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: CPA patchset

2008-01-11 Thread dean gaudet

On Fri, 11 Jan 2008, Ingo Molnar wrote:

> * Andi Kleen <[EMAIL PROTECTED]> wrote:
> 
> > Cached requires the cache line to be read first before you can write 
> > it.
> 
> nonsense, and you should know it. It is perfectly possible to construct 
> fully written cachelines, without reading the cacheline first. MOVDQ is 
> SSE1 so on basically in every CPU today - and it is 16 byte aligned and 
> can generate full cacheline writes, _without_ filling in the cacheline 
> first.

did you mean to write MOVNTPS above?


> Bulk ops (string ops, etc.) will do full cacheline writes too, 
> without filling in the cacheline.

on intel with fast strings enabled yes.  mind you intel gives hints in
the documentation these operations don't respect coherence... and i
asked about this when they posted their memory ordering paper but got no
response.

-dean
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: CPA patchset

2008-01-11 Thread dean gaudet

On Fri, 11 Jan 2008, Ingo Molnar wrote:

 * Andi Kleen [EMAIL PROTECTED] wrote:
 
  Cached requires the cache line to be read first before you can write 
  it.
 
 nonsense, and you should know it. It is perfectly possible to construct 
 fully written cachelines, without reading the cacheline first. MOVDQ is 
 SSE1 so on basically in every CPU today - and it is 16 byte aligned and 
 can generate full cacheline writes, _without_ filling in the cacheline 
 first.

did you mean to write MOVNTPS above?


 Bulk ops (string ops, etc.) will do full cacheline writes too, 
 without filling in the cacheline.

on intel with fast strings enabled yes.  mind you intel gives hints in
the documentation these operations don't respect coherence... and i
asked about this when they posted their memory ordering paper but got no
response.

-dean
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: CPA patchset

2008-01-11 Thread dean gaudet

On Fri, 11 Jan 2008, dean gaudet wrote:

 On Fri, 11 Jan 2008, Ingo Molnar wrote:
 
  * Andi Kleen [EMAIL PROTECTED] wrote:
  
   Cached requires the cache line to be read first before you can write 
   it.
  
  nonsense, and you should know it. It is perfectly possible to construct 
  fully written cachelines, without reading the cacheline first. MOVDQ is 
  SSE1 so on basically in every CPU today - and it is 16 byte aligned and 
  can generate full cacheline writes, _without_ filling in the cacheline 
  first.
 
 did you mean to write MOVNTPS above?

btw in case you were thinking a normal store to WB rather than a 
non-temporal store... i ran a microbenchmark streaming stores to every 16 
bytes of a 16MiB region aligned to 4096 bytes on a xeon 53xx series CPU 
(4MiB L2) + 5000X northbridge and the avg latency of MOVNTPS is 12 cycles 
whereas the avg latency of MOVAPS is 20 cycles.

the inner loop is unrolled 16 times so there are literally 4 cache lines 
worth of stores being stuffed into the store queue as fast as possible... 
and there's no coalescing for normal stores even on this modern CPU.

i'm certain i'll see the same thing on AMD... it's a very hard thing to do 
in hardware without the non-temporal hint.

-dean


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-10 Thread dean gaudet

On Thu, 10 Jan 2008, Neil Brown wrote:

 On Wednesday January 9, [EMAIL PROTECTED] wrote:
  On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote:
   i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
   
   http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
   
   which was Neil's change in 2.6.22 for deferring generic_make_request 
   until there's enough stack space for it.
   
  
  Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization
  by preventing recursive calls to generic_make_request.  However the
  following conditions can cause raid5 to hang until 'stripe_cache_size' is
  increased:
  
 
 Thanks for pursuing this guys.  That explanation certainly sounds very
 credible.
 
 The generic_make_request_immed is a good way to confirm that we have
 found the bug,  but I don't like it as a long term solution, as it
 just reintroduced the problem that we were trying to solve with the
 problematic commit.
 
 As you say, we could arrange that all request submission happens in
 raid5d and I think this is the right way to proceed.  However we can
 still take some of the work into the thread that is submitting the
 IO by calling raid5d() at the end of make_request, like this.
 
 Can you test it please?  Does it seem reasonable?
 
 Thanks,
 NeilBrown
 
 
 Signed-off-by: Neil Brown [EMAIL PROTECTED]

it has passed 11h of the untar/diff/rm linux.tar.gz workload... that's 
pretty good evidence it works for me.  thanks!

Tested-by: dean gaudet [EMAIL PROTECTED]

 
 ### Diffstat output
  ./drivers/md/md.c|2 +-
  ./drivers/md/raid5.c |4 +++-
  2 files changed, 4 insertions(+), 2 deletions(-)
 
 diff .prev/drivers/md/md.c ./drivers/md/md.c
 --- .prev/drivers/md/md.c 2008-01-07 13:32:10.0 +1100
 +++ ./drivers/md/md.c 2008-01-10 11:08:02.0 +1100
 @@ -5774,7 +5774,7 @@ void md_check_recovery(mddev_t *mddev)
   if (mddev-ro)
   return;
  
 - if (signal_pending(current)) {
 + if (current == mddev-thread-tsk  signal_pending(current)) {
   if (mddev-pers-sync_request) {
   printk(KERN_INFO md: %s in immediate safe mode\n,
  mdname(mddev));
 
 diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
 --- .prev/drivers/md/raid5.c  2008-01-07 13:32:10.0 +1100
 +++ ./drivers/md/raid5.c  2008-01-10 11:06:54.0 +1100
 @@ -3432,6 +3432,7 @@ static int chunk_aligned_read(struct req
   }
  }
  
 +static void raid5d (mddev_t *mddev);
  
  static int make_request(struct request_queue *q, struct bio * bi)
  {
 @@ -3547,7 +3548,7 @@ static int make_request(struct request_q
   goto retry;
   }
   finish_wait(conf-wait_for_overlap, w);
 - handle_stripe(sh, NULL);
 + set_bit(STRIPE_HANDLE, sh-state);
   release_stripe(sh);
   } else {
   /* cannot get stripe for read-ahead, just give-up */
 @@ -3569,6 +3570,7 @@ static int make_request(struct request_q
 test_bit(BIO_UPTODATE, bi-bi_flags)
   ? 0 : -EIO);
   }
 + raid5d(mddev);
   return 0;
  }
  
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-10 Thread dean gaudet

On Fri, 11 Jan 2008, Neil Brown wrote:

 Thanks.
 But I suspect you didn't test it with a bitmap :-)
 I ran the mdadm test suite and it hit a problem - easy enough to fix.

damn -- i lost my bitmap 'cause it was external and i didn't have things 
set up properly to pick it up after a reboot :)

if you send an updated patch i'll give it another spin...

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid 1, can't get the second disk added back in.

2008-01-09 Thread dean gaudet

On Tue, 8 Jan 2008, Bill Davidsen wrote:

 Neil Brown wrote:
  On Monday January 7, [EMAIL PROTECTED] wrote:

   Problem is not raid, or at least not obviously raid related.  The problem
   is that the whole disk, /dev/hdb is unavailable. 
  
  Maybe check /sys/block/hdb/holders ?  lsof /dev/hdb ?
  
  good luck :-)
  

 losetup -a may help, lsof doesn't seem to show files used in loop mounts. Yes,
 long shot...

and don't forget dmsetup ls... (followed immediately by apt-get remove 
evms if you're on an unfortunate version of ubuntu which helpfully 
installed that partition-stealing service for you.)

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-30 Thread dean gaudet

On Sat, 29 Dec 2007, Dan Williams wrote:

On Dec 29, 2007 1:58 PM, dean gaudet [EMAIL PROTECTED] wrote:
On Sat, 29 Dec 2007, Dan Williams wrote:

On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote:
hmm bummer, i'm doing another test (rsync 3.5M inodes from another box)
on
the same 64k chunk array and had raised the stripe_cache_size to 1024...
and got a hang. this time i grabbed stripe_cache_active before bumping
the size again -- it was only 905 active. as i recall the bug we were
debugging a year+ ago the active was at the size when it would hang. so
this is probably something new.

I believe I am seeing the same issue and am trying to track down
whether XFS is doing something unexpected, i.e. I have not been able
to reproduce the problem with EXT3. MD tries to increase throughput
by letting some stripe work build up in batches. It looks like every
time your system has hung it has been in the 'inactive_blocked' state
i.e. 3/4 of stripes active. This state should automatically
clear...

cool, glad you can reproduce it :)

i have a bit more data... i'm seeing the same problem on debian's
2.6.22-3-amd64 kernel, so it's not new in 2.6.24.

This is just brainstorming at this point, but it looks like xfs can
submit more requests in the bi_end_io path such that it can lock
itself out of the RAID array. The sequence that concerns me is:

return_io-xfs_buf_end_io-xfs_buf_io_end-xfs_buf_iodone_work-xfs_buf_iorequest-make_request-hang

I need verify whether this path is actually triggering, but if we are
in an inactive_blocked condition this new request will be put on a
wait queue and we'll never get to the release_stripe() call after
return_io(). It would be interesting to see if this is new XFS
behavior in recent kernels.

i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1

which was Neil's change in 2.6.22 for deferring generic_make_request
until there's enough stack space for it.

with my git tree sync'd to that commit my test cases fail in under 20
minutes uptime (i rebooted and tested 3x). sync'd to the commit previous
to it i've got 8h of run-time now without the problem.

this isn't definitive of course since it does seem to be timing
dependent, but since all failures have occured much earlier than that
for me so far i think this indicates this change is either the cause of
the problem or exacerbates an existing raid5 problem.

given that this problem looks like a very rare problem i saw with 2.6.18
(raid5+xfs there too) i'm thinking Neil's commit may just exacerbate an
existing problem... not that i have evidence either way.

i've attached a new kernel log with a hang at d89d87965d... and the
reduced config file i was using for the bisect. hopefully the hang
looks the same as what we were seeing at 2.6.24-rc6. let me know.

-dean

kern.log.d89d87965d.bz2
Description: Binary data

config-2.6.21-b1.bz2
Description: Binary data

Re: [patch] improve stripe_cache_size documentation

2007-12-30 Thread dean gaudet

On Sun, 30 Dec 2007, Thiemo Nagel wrote:

 stripe_cache_size  (currently raid5 only)
 
 As far as I have understood, it applies to raid6, too.

good point... and raid4.

here's an updated patch.

-dean

Signed-off-by: dean gaudet [EMAIL PROTECTED]

Index: linux/Documentation/md.txt
===
--- linux.orig/Documentation/md.txt 2007-12-29 13:01:25.0 -0800
+++ linux/Documentation/md.txt  2007-12-30 10:16:58.0 -0800
@@ -435,8 +435,14 @@
 
 These currently include
 
-  stripe_cache_size  (currently raid5 only)
+  stripe_cache_size  (raid4, raid5 and raid6)
   number of entries in the stripe cache.  This is writable, but
   there are upper and lower limits (32768, 16).  Default is 128.
-  strip_cache_active (currently raid5 only)
+
+  The stripe cache memory is locked down and not available for other uses.
+  The total size of the stripe cache is determined by this formula:
+
+PAGE_SIZE * raid_disks * stripe_cache_size
+
+  strip_cache_active (raid4, raid5 and raid6)
   number of active entries in the stripe cache
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] improve stripe_cache_size documentation

2007-12-30 Thread dean gaudet

On Sun, 30 Dec 2007, dean gaudet wrote:

 On Sun, 30 Dec 2007, Thiemo Nagel wrote:
 
  stripe_cache_size  (currently raid5 only)
  
  As far as I have understood, it applies to raid6, too.
 
 good point... and raid4.
 
 here's an updated patch.

and once again with a typo fix.  oops.

-dean

Signed-off-by: dean gaudet [EMAIL PROTECTED]

Index: linux/Documentation/md.txt
===
--- linux.orig/Documentation/md.txt 2007-12-29 13:01:25.0 -0800
+++ linux/Documentation/md.txt  2007-12-30 14:30:40.0 -0800
@@ -435,8 +435,14 @@
 
 These currently include
 
-  stripe_cache_size  (currently raid5 only)
+  stripe_cache_size  (raid4, raid5 and raid6)
   number of entries in the stripe cache.  This is writable, but
   there are upper and lower limits (32768, 16).  Default is 128.
-  strip_cache_active (currently raid5 only)
+
+  The stripe cache memory is locked down and not available for other uses.
+  The total size of the stripe cache is determined by this formula:
+
+PAGE_SIZE * raid_disks * stripe_cache_size
+
+  stripe_cache_active (raid4, raid5 and raid6)
   number of active entries in the stripe cache

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC: permit link(2) to work across --bind mounts ?

2007-12-29 Thread dean gaudet

On Sat, 29 Dec 2007, [EMAIL PROTECTED] wrote:

> On Sat, 29 Dec 2007 12:40:47 PST, dean gaudet said:
> 
> > the main worry i have is some user maliciously hardlinks everything
> > under /var/log somewhere else and slowly fills up the file system with
> > old rotated logs.
> 
> "Doctor, it hurts when I do this.." "Well, don't do that then".

actually it doesn't hurt.  i have other mechanisms which would pick this 
up fairly quickly.

-dean
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: permit link(2) to work across --bind mounts ?

2007-12-29 Thread dean gaudet



On Sun, 30 Dec 2007, David Newall wrote:

> dean gaudet wrote:
> > > Pffuff.  That's what volume managers are for!  You do have (at least) two
> > > independent spindles in your RAID1 array, which give you less need to
> > > worry
> > > about head-stack contention.
> > > 
> > 
> > this system is write intensive and writes go to all spindles, so you're
> > assertion is wrong.
> 
> I don't know what you think I was asserting, but you were wrong.  Of course
> I/O is distributed across both spindles.  You would expect no less.  THAT is
> what I was telling you.

are you on crack?

it's a raid1.  writes go to all spindles.  they have to.  by definition.  
reads can be spread around, but writes are mirrored.

> 
> > the main worry i have is some user maliciously hardlinks everything
> > under /var/log somewhere else and slowly fills up the file system with
> > old rotated logs.  the users otherwise have quotas so they can't fill
> > things up on their own.  i could probably set up XFS quota trees (aka
> > "projects") but haven't gone to this effort yet.
> >   
> 
> See, this is where you show that you don't understand the system.  I'll
> explain it, just once.  /var/home contains  home directories.  /var/log and
> /var/home are on the same filesystem.  So /var/log/* can be linked to
> /var/home/malicious, and that's just one of your basic misunderstandings.

yes you are on crack.

i told you i understand this exactly.  it's right there in the message 
sent.

> No.  Look, you obviously haven't read what I've told you.  I mean, it's very
> obvious you haven't.  I'm wasting my time on you and I'm now out of
> generosity.  Good luck to you.  I think you need it.

you're the idiot not actually reading my messages.

-dean
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: permit link(2) to work across --bind mounts ?

2007-12-29 Thread dean gaudet

On Sat, 29 Dec 2007, David Newall wrote:

> dean gaudet wrote:
> > On Wed, 19 Dec 2007, David Newall wrote:
> >   
> > > Mark Lord wrote:
> > > 
> > > > But.. pity there's no mount flag override for smaller systems,
> > > > where bind mounts might be more useful with link(2) actually working.
> > > >   
> > > I don't see it.  You always can make hard link on the underlying
> > > filesystem.
> > > If you need to make it on the bound mount, that is, if you can't locate
> > > the
> > > underlying filesystem to make the hard link, you can use a symbolic link.
> > > 
> > 
> > i run into it on a system where /home is a bind mount of /var/home ... i did
> > this because:
> > 
> > - i prefer /home to be nosuid,nodev (multi-user system)
> >   
> 
> Whatever security /home has, /var/home is the one that restricts because users
> can still access their files that way.

yep.  and /var is nosuid,nodev as well.

> > - i prefer /home to not be on same fs as /
> > - the system has only one raid1 array, and i can't stand having two
> > writable filesystems competing on the same set of spindles (i like to
> >   imagine that one fs competing for the spindles can potentially result
> >   in better seek patterns)
> > ...
> > - i didn't want to try to balance disk space between /var and /home
> > - i didn't want to use a volume mgr just to handle disk space balance...
> >   
> 
> Pffuff.  That's what volume managers are for!  You do have (at least) two
> independent spindles in your RAID1 array, which give you less need to worry
> about head-stack contention.

this system is write intensive and writes go to all spindles, so you're
assertion is wrong.  a quick look at iostat shows the system has averaged
50/50 reads/writes over 34 days.  that means 50% of the IO is going to
both spindles.

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sda   1.96 2.24   33.65   33.16   755.50   465.4536.55 
0.568.43   5.98  39.96

> You probably want different mount restrictions
> on /home than /var, so you really must use separate filesystems.

not sure why you think i want different restrictions... i'm running fine
with nosuid,nodev for /var.

the main worry i have is some user maliciously hardlinks everything
under /var/log somewhere else and slowly fills up the file system with
old rotated logs.  the users otherwise have quotas so they can't fill
things up on their own.  i could probably set up XFS quota trees (aka
"projects") but haven't gone to this effort yet.

> LVM is your friend.

i disagree.  but this is getting into personal taste -- i find volume
managers to be an unnecessary layer of complexity.  given i need quotas for
the users anyhow i don't see why i should both manage my disk space via
quotas and via an extra block layer.

> 
> But with regards to bind mounts and hard links:  If you want to be able to
> hard-link /home/me/log to /var/tmp/my-log, then I see nothing to prevent
> hard-linking /var/home/me/log to /var/tmp/my-log.

you probably missed the point where i said that i was surprised i couldn't
hardlink across the bind mount and actually wanted it to work.

-dean
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet

hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on 
the same 64k chunk array and had raised the stripe_cache_size to 1024... 
and got a hang.  this time i grabbed stripe_cache_active before bumping 
the size again -- it was only 905 active.  as i recall the bug we were 
debugging a year+ ago the active was at the size when it would hang.  so 
this is probably something new.

anyhow raising it to 2048 got it unstuck, but i'm guessing i'll be able to 
hit that limit too if i try harder :)

btw what units are stripe_cache_size/active in?  is the memory consumed 
equal to (chunk_size * raid_disks * stripe_cache_size) or (chunk_size * 
raid_disks * stripe_cache_active)?

-dean

On Thu, 27 Dec 2007, dean gaudet wrote:

 hmm this seems more serious... i just ran into it with chunksize 64KiB and 
 while just untarring a bunch of linux kernels in parallel... increasing 
 stripe_cache_size did the trick again.
 
 -dean
 
 On Thu, 27 Dec 2007, dean gaudet wrote:
 
  hey neil -- remember that raid5 hang which me and only one or two others 
  ever experienced and which was hard to reproduce?  we were debugging it 
  well over a year ago (that box has 400+ day uptime now so at least that 
  long ago :)  the workaround was to increase stripe_cache_size... i seem to 
  have a way to reproduce something which looks much the same.
  
  setup:
  
  - 2.6.24-rc6
  - system has 8GiB RAM but no swap
  - 8x750GB in a raid5 with one spare, chunksize 1024KiB.
  - mkfs.xfs default options
  - mount -o noatime
  - dd if=/dev/zero of=/mnt/foo bs=4k count=2621440
  
  that sequence hangs for me within 10 seconds... and i can unhang / rehang 
  it by toggling between stripe_cache_size 256 and 1024.  i detect the hang 
  by watching iostat -kx /dev/sd? 5.
  
  i've attached the kernel log where i dumped task and timer state while it 
  was hung... note that you'll see at some point i did an xfs mount with 
  external journal but it happens with internal journal as well.
  
  looks like it's using the raid456 module and async api.
  
  anyhow let me know if you need more info / have any suggestions.
  
  -dean
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-29 Thread dean gaudet

On Tue, 25 Dec 2007, Bill Davidsen wrote:

 The issue I'm thinking about is hardware sector size, which on modern drives
 may be larger than 512b and therefore entail a read-alter-rewrite (RAR) cycle
 when writing a 512b block.

i'm not sure any shipping SATA disks have larger than 512B sectors yet... 
do you know of any?  (or is this thread about SCSI which i don't pay 
attention to...)

on a brand new WDC WD7500AAKS-00RBA0 with this partition layout:

255 heads, 63 sectors/track, 91201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

so sda1 starts at a non-multiple of 4096 into the disk.

i ran some random seek+write experiments using
http://arctic.org/~dean/randomio/, here are the results using 512 byte
and 4096 byte writes (fsync after each write), 8 threads, on sda1:

# ./randomio /dev/sda1 8 1 1 512 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  148.5 |0.0   infnan0.0nan |  148.5   0.2   53.7   89.3   19.5
  129.2 |0.0   infnan0.0nan |  129.2  37.2   61.9   96.79.3
  131.2 |0.0   infnan0.0nan |  131.2  40.3   61.0   90.49.3
  132.0 |0.0   infnan0.0nan |  132.0  39.6   60.6   89.39.1
  130.7 |0.0   infnan0.0nan |  130.7  39.8   61.3   98.18.9
  131.4 |0.0   infnan0.0nan |  131.4  40.0   60.8  101.09.6
# ./randomio /dev/sda1 8 1 1 4096 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  141.7 |0.0   infnan0.0nan |  141.7   0.3   56.3   99.3   21.1
  132.4 |0.0   infnan0.0nan |  132.4  43.3   60.4   91.88.5
  131.6 |0.0   infnan0.0nan |  131.6  41.4   60.9  111.09.6
  131.8 |0.0   infnan0.0nan |  131.8  41.4   60.7   85.38.6
  130.6 |0.0   infnan0.0nan |  130.6  41.7   61.3   95.09.4
  131.4 |0.0   infnan0.0nan |  131.4  42.2   60.8   90.58.4


i think the anomalous results in the first 10s samples are perhaps the drive
coming out of a standby state.

and here are the results aligned using the sda raw device itself:

# ./randomio /dev/sda 8 1 1 512 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  147.3 |0.0   infnan0.0nan |  147.3   0.3   54.1   93.7   20.1
  132.4 |0.0   infnan0.0nan |  132.4  37.4   60.6   91.89.2
  132.5 |0.0   infnan0.0nan |  132.5  37.7   60.3   93.79.3
  131.8 |0.0   infnan0.0nan |  131.8  39.4   60.7   92.79.0
  133.9 |0.0   infnan0.0nan |  133.9  41.7   59.8   90.78.5
  130.2 |0.0   infnan0.0nan |  130.2  40.8   61.5   88.68.9
# ./randomio /dev/sda 8 1 1 4096 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  145.4 |0.0   infnan0.0nan |  145.4   0.3   54.9   94.0   20.1
  130.3 |0.0   infnan0.0nan |  130.3  36.0   61.4   92.79.6
  130.6 |0.0   infnan0.0nan |  130.6  38.2   61.2   96.79.2
  132.1 |0.0   infnan0.0nan |  132.1  39.0   60.5   93.59.2
  131.8 |0.0   infnan0.0nan |  131.8  43.1   60.8   93.89.1
  129.0 |0.0   infnan0.0nan |  129.0  40.2   62.0   96.48.8

it looks pretty much the same to me...

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet

On Sat, 29 Dec 2007, Dan Williams wrote:

 On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote:
  hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on
  the same 64k chunk array and had raised the stripe_cache_size to 1024...
  and got a hang.  this time i grabbed stripe_cache_active before bumping
  the size again -- it was only 905 active.  as i recall the bug we were
  debugging a year+ ago the active was at the size when it would hang.  so
  this is probably something new.
 
 I believe I am seeing the same issue and am trying to track down
 whether XFS is doing something unexpected, i.e. I have not been able
 to reproduce the problem with EXT3.  MD tries to increase throughput
 by letting some stripe work build up in batches.  It looks like every
 time your system has hung it has been in the 'inactive_blocked' state
 i.e.  3/4 of stripes active.  This state should automatically
 clear...

cool, glad you can reproduce it :)

i have a bit more data... i'm seeing the same problem on debian's 
2.6.22-3-amd64 kernel, so it's not new in 2.6.24.

i'm doing some more isolation but just grabbing kernels i have precompiled 
so far -- a 2.6.19.7 kernel doesn't show the problem, and early 
indications are a 2.6.21.7 kernel also doesn't have the problem but i'm 
giving it longer to show its head.

i'll try a stock 2.6.22 next depending on how the 2.6.21 test goes, just 
so we get the debian patches out of the way.

i was tempted to blame async api because it's newish :)  but according to 
the dmesg output it doesn't appear the 2.6.22-3-amd64 kernel used async 
API, and it still hung, so async is probably not to blame.

anyhow the test case i'm using is the dma_thrasher script i attached... it 
takes about an hour to give me confidence there's no problems so this will 
take a while.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch] improve stripe_cache_size documentation

2007-12-29 Thread dean gaudet

Document the amount of memory used by the stripe cache and the fact that 
it's tied down and unavailable for other purposes (right?).  thanks to Dan 
Williams for the formula.

-dean

Signed-off-by: dean gaudet [EMAIL PROTECTED]

Index: linux/Documentation/md.txt
===
--- linux.orig/Documentation/md.txt 2007-12-29 13:01:25.0 -0800
+++ linux/Documentation/md.txt  2007-12-29 13:04:17.0 -0800
@@ -438,5 +438,11 @@
   stripe_cache_size  (currently raid5 only)
   number of entries in the stripe cache.  This is writable, but
   there are upper and lower limits (32768, 16).  Default is 128.
+
+  The stripe cache memory is locked down and not available for other uses.
+  The total size of the stripe cache is determined by this formula:
+
+PAGE_SIZE * raid_disks * stripe_cache_size
+
   strip_cache_active (currently raid5 only)
   number of active entries in the stripe cache
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet

On Sat, 29 Dec 2007, Justin Piszcz wrote:

 Curious btw what kind of filesystem size/raid type (5, but defaults I assume,
 nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache
 size/chunk size(s) are you using/testing with?

mdadm --create --level=5 --chunk=64 -n7 -x1 /dev/md2 /dev/sd[a-h]1
mkfs.xfs -f /dev/md2

otherwise defaults

 The script you sent out earlier, you are able to reproduce it easily with 31
 or so kernel tar decompressions?

not sure, the point of the script is to untar more than there is RAM.  it 
happened with a single rsync running though -- 3.5M indoes from a remote 
box.  it also happens with the single 10GB dd write... although i've been 
using the tar method for testing different kernel revs.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet


On Sat, 29 Dec 2007, dean gaudet wrote:

 On Sat, 29 Dec 2007, Justin Piszcz wrote:
 
  Curious btw what kind of filesystem size/raid type (5, but defaults I 
  assume,
  nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache
  size/chunk size(s) are you using/testing with?
 
 mdadm --create --level=5 --chunk=64 -n7 -x1 /dev/md2 /dev/sd[a-h]1
 mkfs.xfs -f /dev/md2
 
 otherwise defaults

hmm i missed a few things, here's exactly how i created the array:

mdadm --create --level=5 --chunk=64 -n7 -x1 --assume-clean /dev/md2 
/dev/sd[a-h]1

it's reassembled automagically each reboot, but i do this each reboot:

mkfs.xfs -f /dev/md2
mount -o noatime /dev/md2 /mnt/new
./dma_thrasher linux.tar.gz /mnt/new

the --assume-clean and noatime probably make no difference though...

on the bisection front it looks like it's new behaviour between 2.6.21.7 
and 2.6.22.15 (stock kernels now, not debian).

i've got to step out for a while, but i'll go at it again later, probably 
with git bisect unless someone has some cherry picked changes to suggest.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC: permit link(2) to work across --bind mounts ?

2007-12-29 Thread dean gaudet

On Sat, 29 Dec 2007, David Newall wrote:

 dean gaudet wrote:
  On Wed, 19 Dec 2007, David Newall wrote:

   Mark Lord wrote:
   
But.. pity there's no mount flag override for smaller systems,
where bind mounts might be more useful with link(2) actually working.
  
   I don't see it.  You always can make hard link on the underlying
   filesystem.
   If you need to make it on the bound mount, that is, if you can't locate
   the
   underlying filesystem to make the hard link, you can use a symbolic link.
   
  
  i run into it on a system where /home is a bind mount of /var/home ... i did
  this because:
  
  - i prefer /home to be nosuid,nodev (multi-user system)

 
 Whatever security /home has, /var/home is the one that restricts because users
 can still access their files that way.

yep.  and /var is nosuid,nodev as well.

  - i prefer /home to not be on same fs as /
  - the system has only one raid1 array, and i can't stand having two
  writable filesystems competing on the same set of spindles (i like to
imagine that one fs competing for the spindles can potentially result
in better seek patterns)
  ...
  - i didn't want to try to balance disk space between /var and /home
  - i didn't want to use a volume mgr just to handle disk space balance...

 
 Pffuff.  That's what volume managers are for!  You do have (at least) two
 independent spindles in your RAID1 array, which give you less need to worry
 about head-stack contention.

this system is write intensive and writes go to all spindles, so you're
assertion is wrong.  a quick look at iostat shows the system has averaged
50/50 reads/writes over 34 days.  that means 50% of the IO is going to
both spindles.

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sda   1.96 2.24   33.65   33.16   755.50   465.4536.55 
0.568.43   5.98  39.96

 You probably want different mount restrictions
 on /home than /var, so you really must use separate filesystems.

not sure why you think i want different restrictions... i'm running fine
with nosuid,nodev for /var.

the main worry i have is some user maliciously hardlinks everything
under /var/log somewhere else and slowly fills up the file system with
old rotated logs.  the users otherwise have quotas so they can't fill
things up on their own.  i could probably set up XFS quota trees (aka
projects) but haven't gone to this effort yet.


 LVM is your friend.

i disagree.  but this is getting into personal taste -- i find volume
managers to be an unnecessary layer of complexity.  given i need quotas for
the users anyhow i don't see why i should both manage my disk space via
quotas and via an extra block layer.


 
 But with regards to bind mounts and hard links:  If you want to be able to
 hard-link /home/me/log to /var/tmp/my-log, then I see nothing to prevent
 hard-linking /var/home/me/log to /var/tmp/my-log.

you probably missed the point where i said that i was surprised i couldn't
hardlink across the bind mount and actually wanted it to work.

-dean
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: permit link(2) to work across --bind mounts ?

2007-12-29 Thread dean gaudet



On Sun, 30 Dec 2007, David Newall wrote:

 dean gaudet wrote:
   Pffuff.  That's what volume managers are for!  You do have (at least) two
   independent spindles in your RAID1 array, which give you less need to
   worry
   about head-stack contention.
   
  
  this system is write intensive and writes go to all spindles, so you're
  assertion is wrong.
 
 I don't know what you think I was asserting, but you were wrong.  Of course
 I/O is distributed across both spindles.  You would expect no less.  THAT is
 what I was telling you.

are you on crack?

it's a raid1.  writes go to all spindles.  they have to.  by definition.  
reads can be spread around, but writes are mirrored.

 
  the main worry i have is some user maliciously hardlinks everything
  under /var/log somewhere else and slowly fills up the file system with
  old rotated logs.  the users otherwise have quotas so they can't fill
  things up on their own.  i could probably set up XFS quota trees (aka
  projects) but haven't gone to this effort yet.

 
 See, this is where you show that you don't understand the system.  I'll
 explain it, just once.  /var/home contains  home directories.  /var/log and
 /var/home are on the same filesystem.  So /var/log/* can be linked to
 /var/home/malicious, and that's just one of your basic misunderstandings.

yes you are on crack.

i told you i understand this exactly.  it's right there in the message 
sent.

 No.  Look, you obviously haven't read what I've told you.  I mean, it's very
 obvious you haven't.  I'm wasting my time on you and I'm now out of
 generosity.  Good luck to you.  I think you need it.

you're the idiot not actually reading my messages.

-dean
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: permit link(2) to work across --bind mounts ?

2007-12-29 Thread dean gaudet

On Sat, 29 Dec 2007, [EMAIL PROTECTED] wrote:

 On Sat, 29 Dec 2007 12:40:47 PST, dean gaudet said:
 
  the main worry i have is some user maliciously hardlinks everything
  under /var/log somewhere else and slowly fills up the file system with
  old rotated logs.
 
 Doctor, it hurts when I do this.. Well, don't do that then.

actually it doesn't hurt.  i have other mechanisms which would pick this 
up fairly quickly.

-dean
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: permit link(2) to work across --bind mounts ?

2007-12-28 Thread dean gaudet

On Sat, 29 Dec 2007, Jan Engelhardt wrote:

> 
> On Dec 28 2007 18:53, dean gaudet wrote:
> >p.s. in retrospect i probably could have arranged it more like this:
> >
> >  mount /dev/md1 $tmpmntpoint
> >  mount --bind $tmpmntpoint/var /var
> >  mount --bind $tmpmntpoint/home /home
> >  umount $tmpmntpoint
> >
> >except i can't easily specify that in fstab... and neither of the bind 
> >mounts would show up in df(1).  seems like it wouldn't be hard to support 
> >this type of subtree mount though.  mount(8) could support a single 
> >subtree mount using this technique but the second subtree mount attempt 
> >would fail because you can't temporarily remount the device because the 
> >mount point is gone.
> 
> Why is it gone?
> 
> mount /dev/md1 /tmpmnt
> mount --bind /tmpmnt/var /var
> mount --bind /tmpmnt/home /home
> 
> Is perfectly fine, and /tmpmnt is still alive and mounted. Additionally,
> you can
> 
> umount /tmpmnt
> 
> now, which leaves only /var and /home.

i was trying to come up with a userland-only change in mount(8) which
would behave like so:

# mount --subtree var /dev/md1 /var
  internally mount does:
  - mount /dev/md1 /tmpmnt
  - mount --bind /tmpmnt/var /var
  - umount /tmpmnt

# mount --subtree home /dev/md1 /home
  internally mount does:
  - mount /dev/md1 /tmpmnt
  - mount --bind /tmpmnt/home /home
  - umount /tmpmnt

but that second mount would fail because /dev/md1 is already mounted
(but the mount point is gone)...

it certainly works if i issue the commands individually as i described
-- but a change within mount(8) would have the benefit of working with
/etc/fstab too.

-dean
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: permit link(2) to work across --bind mounts ?

2007-12-28 Thread dean gaudet

On Wed, 19 Dec 2007, David Newall wrote:

> Mark Lord wrote:
> > But.. pity there's no mount flag override for smaller systems,
> > where bind mounts might be more useful with link(2) actually working.
> 
> I don't see it.  You always can make hard link on the underlying filesystem.
> If you need to make it on the bound mount, that is, if you can't locate the
> underlying filesystem to make the hard link, you can use a symbolic link.

i run into it on a system where /home is a bind mount of /var/home ... i 
did this because:

- i prefer /home to be nosuid,nodev (multi-user system)
- i prefer /home to not be on same fs as /
- the system has only one raid1 array, and i can't stand having two 
  writable filesystems competing on the same set of spindles (i like to
  imagine that one fs competing for the spindles can potentially result
  in better seek patterns)
- i didn't want to do /var -> /home/var or vice versa ... because i don't 
  like seeing "/var/home/dean" when i'm in my home dir and such.
- i didn't want to try to balance disk space between /var and /home
- i didn't want to use a volume mgr just to handle disk space balance...

so i gave a bind mount a try.

i was surprised to see that mv(1) between /var and /home causes the file 
to be copied due to the link(1) failing...

it does seem like something which should be configurable per mount 
point... maybe that can be done with the patches i've seen going around 
supporting per-bind mount read-only/etc options?

-dean

p.s. in retrospect i probably could have arranged it more like this:

  mount /dev/md1 $tmpmntpoint
  mount --bind $tmpmntpoint/var /var
  mount --bind $tmpmntpoint/home /home
  umount $tmpmntpoint

except i can't easily specify that in fstab... and neither of the bind 
mounts would show up in df(1).  seems like it wouldn't be hard to support 
this type of subtree mount though.  mount(8) could support a single 
subtree mount using this technique but the second subtree mount attempt 
would fail because you can't temporarily remount the device because the 
mount point is gone.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: permit link(2) to work across --bind mounts ?

2007-12-28 Thread dean gaudet

On Wed, 19 Dec 2007, David Newall wrote:

 Mark Lord wrote:
  But.. pity there's no mount flag override for smaller systems,
  where bind mounts might be more useful with link(2) actually working.
 
 I don't see it.  You always can make hard link on the underlying filesystem.
 If you need to make it on the bound mount, that is, if you can't locate the
 underlying filesystem to make the hard link, you can use a symbolic link.

i run into it on a system where /home is a bind mount of /var/home ... i 
did this because:

- i prefer /home to be nosuid,nodev (multi-user system)
- i prefer /home to not be on same fs as /
- the system has only one raid1 array, and i can't stand having two 
  writable filesystems competing on the same set of spindles (i like to
  imagine that one fs competing for the spindles can potentially result
  in better seek patterns)
- i didn't want to do /var - /home/var or vice versa ... because i don't 
  like seeing /var/home/dean when i'm in my home dir and such.
- i didn't want to try to balance disk space between /var and /home
- i didn't want to use a volume mgr just to handle disk space balance...

so i gave a bind mount a try.

i was surprised to see that mv(1) between /var and /home causes the file 
to be copied due to the link(1) failing...

it does seem like something which should be configurable per mount 
point... maybe that can be done with the patches i've seen going around 
supporting per-bind mount read-only/etc options?

-dean

p.s. in retrospect i probably could have arranged it more like this:

  mount /dev/md1 $tmpmntpoint
  mount --bind $tmpmntpoint/var /var
  mount --bind $tmpmntpoint/home /home
  umount $tmpmntpoint

except i can't easily specify that in fstab... and neither of the bind 
mounts would show up in df(1).  seems like it wouldn't be hard to support 
this type of subtree mount though.  mount(8) could support a single 
subtree mount using this technique but the second subtree mount attempt 
would fail because you can't temporarily remount the device because the 
mount point is gone.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: permit link(2) to work across --bind mounts ?

2007-12-28 Thread dean gaudet

On Sat, 29 Dec 2007, Jan Engelhardt wrote:

 
 On Dec 28 2007 18:53, dean gaudet wrote:
 p.s. in retrospect i probably could have arranged it more like this:
 
   mount /dev/md1 $tmpmntpoint
   mount --bind $tmpmntpoint/var /var
   mount --bind $tmpmntpoint/home /home
   umount $tmpmntpoint
 
 except i can't easily specify that in fstab... and neither of the bind 
 mounts would show up in df(1).  seems like it wouldn't be hard to support 
 this type of subtree mount though.  mount(8) could support a single 
 subtree mount using this technique but the second subtree mount attempt 
 would fail because you can't temporarily remount the device because the 
 mount point is gone.
 
 Why is it gone?
 
 mount /dev/md1 /tmpmnt
 mount --bind /tmpmnt/var /var
 mount --bind /tmpmnt/home /home
 
 Is perfectly fine, and /tmpmnt is still alive and mounted. Additionally,
 you can
 
 umount /tmpmnt
 
 now, which leaves only /var and /home.

i was trying to come up with a userland-only change in mount(8) which
would behave like so:

# mount --subtree var /dev/md1 /var
  internally mount does:
  - mount /dev/md1 /tmpmnt
  - mount --bind /tmpmnt/var /var
  - umount /tmpmnt

# mount --subtree home /dev/md1 /home
  internally mount does:
  - mount /dev/md1 /tmpmnt
  - mount --bind /tmpmnt/home /home
  - umount /tmpmnt

but that second mount would fail because /dev/md1 is already mounted
(but the mount point is gone)...

it certainly works if i issue the commands individually as i described
-- but a change within mount(8) would have the benefit of working with
/etc/fstab too.

-dean
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-27 Thread dean gaudet

hmm this seems more serious... i just ran into it with chunksize 64KiB and 
while just untarring a bunch of linux kernels in parallel... increasing 
stripe_cache_size did the trick again.

-dean

On Thu, 27 Dec 2007, dean gaudet wrote:

 hey neil -- remember that raid5 hang which me and only one or two others 
 ever experienced and which was hard to reproduce?  we were debugging it 
 well over a year ago (that box has 400+ day uptime now so at least that 
 long ago :)  the workaround was to increase stripe_cache_size... i seem to 
 have a way to reproduce something which looks much the same.
 
 setup:
 
 - 2.6.24-rc6
 - system has 8GiB RAM but no swap
 - 8x750GB in a raid5 with one spare, chunksize 1024KiB.
 - mkfs.xfs default options
 - mount -o noatime
 - dd if=/dev/zero of=/mnt/foo bs=4k count=2621440
 
 that sequence hangs for me within 10 seconds... and i can unhang / rehang 
 it by toggling between stripe_cache_size 256 and 1024.  i detect the hang 
 by watching iostat -kx /dev/sd? 5.
 
 i've attached the kernel log where i dumped task and timer state while it 
 was hung... note that you'll see at some point i did an xfs mount with 
 external journal but it happens with internal journal as well.
 
 looks like it's using the raid456 module and async api.
 
 anyhow let me know if you need more info / have any suggestions.
 
 -dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-27 Thread dean gaudet

On Thu, 27 Dec 2007, Justin Piszcz wrote:

 With that high of a stripe size the stripe_cache_size needs to be greater than
 the default to handle it.

i'd argue that any deadlock is a bug...

regardless i'm still seeing deadlocks with the default chunk_size of 64k 
and stripe_cache_size of 256... in this case it's with a workload which is 
untarring 34 copies of the linux kernel at the same time.  it's a variant 
of doug ledford's memtest, and i've attached it.

-dean#!/usr/bin/perl

# Copyright (c) 2007 dean gaudet [EMAIL PROTECTED]
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the Software),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included
# in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
# OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
# ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
# OTHER DEALINGS IN THE SOFTWARE.

# this idea shamelessly stolen from doug ledford

use warnings;
use strict;

# ensure stdout is not buffered
select(STDOUT); $| = 1;

my $usage = usage: $0 linux.tar.gz /path1 [/path2 ...]\n;
defined(my $tarball = shift) or die $usage;
-f $tarball or die $tarball does not exist or is not a file\n;

my @paths = @ARGV;
$#paths = 0 or die $usage;

# determine size of uncompressed tarball
open(GZIP, -|) || exec gzip, --quiet, --list, $tarball;
my $line = GZIP;
my ($tarball_size) = $line =~ m#^\s*\d+\s*(\d+)#;
defined($tarball_size) or die unexpected result from gzip --quiet --list 
$tarball\n;
close(GZIP);

# determine amount of memory
open(MEMINFO, /proc/meminfo)
or die unable to open /proc/meminfo for read: $!\n;
my $total_mem;
while (MEMINFO) {
  if (/^MemTotal:\s*(\d+)\s*kB/) {
$total_mem = $1;
last;
  }
}
defined($total_mem) or die did not find MemTotal line in /proc/meminfo\n;
close(MEMINFO);
$total_mem *= 1024;

print total memory: $total_mem\n;
print uncompressed tarball: $tarball_size\n;
my $nr_simultaneous = int(1.2 * $total_mem / $tarball_size);
print nr simultaneous processes: $nr_simultaneous\n;

sub system_or_die {
  my @args = @_;
  system(@args);
  if ($? == 1) {
my $msg = sprintf(%s failed to exec %s: $!\n, scalar(localtime), 
$args[0]);
  }
  elsif ($?  127) {
my $msg = sprintf(%s %s died with signal %d, %s coredump\n,
scalar(localtime), $args[0], ($?  127), ($?  128) ? with : 
without);
die $msg;
  }
  elsif (($?  8) != 0) {
my $msg = sprintf(%s %s exited with non-zero exit code %d\n,
scalar(localtime), $args[0], $?  8);
die $msg;
  }
}

sub untar($) {
  mkdir($_[0]) or die localtime(). unable to mkdir($_[0]): $!\n;
  system_or_die(tar, -xzf, $tarball, -C, $_[0]);
}

print localtime(). untarring golden copy\n;
my $golden = $paths[0]./dma_tmp.$$.gold;
untar($golden);

my $pass_no = 0;
while (1) {
  print localtime(). pass $pass_no: extracting\n;
  my @outputs;
  foreach my $n (1..$nr_simultaneous) {
# treat paths in a round-robin manner
my $dir = shift(@paths);
push(@paths, $dir);

$dir .= /dma_tmp.$$.$n;
push(@outputs, $dir);

my $pid = fork;
defined($pid) or die localtime(). unable to fork: $!\n;
if ($pid == 0) {
  untar($dir);
  exit(0);
}
  }

  # wait for the children
  while (wait != -1) {}

  print localtime(). pass $pass_no: diffing\n;
  foreach my $dir (@outputs) {
my $pid = fork;
defined($pid) or die localtime(). unable to fork: $!\n;
if ($pid == 0) {
  system_or_die(diff, -U, 3, -rN, $golden, $dir);
  system_or_die(rm, -fr, $dir);
  exit(0);
}
  }

  # wait for the children
  while (wait != -1) {}

  ++$pass_no;
}

Re: external bitmaps.. and more

2007-12-11 Thread dean gaudet

On Thu, 6 Dec 2007, Michael Tokarev wrote:

 I come across a situation where external MD bitmaps
 aren't usable on any standard linux distribution
 unless special (non-trivial) actions are taken.
 
 First is a small buglet in mdadm, or two.
 
 It's not possible to specify --bitmap= in assemble
 command line - the option seems to be ignored.  But
 it's honored when specified in config file.

i think neil fixed this at some point -- i ran into it / reported 
essentially the same problems here a while ago.


 The thing is that when a external bitmap is being used
 for an array, and that bitmap resides on another filesystem,
 all common distributions fails to start/mount and to
 shutdown/umount arrays/filesystems properly, because
 all starts/stops is done in one script, and all mounts/umounts
 in another, but for bitmaps to work the two should be intermixed
 with each other.

so i've got a debian unstable box which has uptime 402 days (to give you 
an idea how long ago i last tested the reboot sequence).  it has raid1 
root and raid5 /home.  /home has an external bitmap on the root partition.

i have /etc/default/mdadm set with INITRDSTART to start only the root 
raid1 during initrd... this manages to work out later when the external 
bitmap is required.

but it is fragile... and i think it's only possible to get things to work 
with an initrd and the external bitmap on the root fs or by having custom 
initrd and/or rc.d scripts.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Documentation about unaligned memory access

2007-11-26 Thread dean gaudet

On Fri, 23 Nov 2007, Arne Georg Gleditsch wrote:

> dean gaudet <[EMAIL PROTECTED]> writes:
> > on AMD x86 pre-family 10h the boundary is 8 bytes, and on fam 10h it's 16 
> > bytes.  the penalty is a mere 3 cycles if an access crosses the specified 
> > boundary.
> 
> Worth noting though, is that atomic accesses that cross cache lines on
> an Opteron system is going to lock down the Hypertransport fabric for
> you during the operation -- which is obviously not so nice.

ooh awesome, i hadn't measured that before.

on a 2 node sockF / revF with a random pointer chase running on cpu 0 / 
node 0 i see the avg load-to-load cache miss latency jump from 77ns to 
109ns when i add an unaligned lock-intensive workload on one core of node 
1.  the worst i can get the pointer chase latency to is 273ns when i add 
two threads on node 1 fighting over an unaligned lock.

on a 4 node (square) the worst case i can get seems to be an increase from 
98ns with no antagonist to 385ns with 6 antagonists fighting over an 
unaligned lock on the other 3 nodes.

cool.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Documentation about unaligned memory access

2007-11-26 Thread dean gaudet

On Fri, 23 Nov 2007, Arne Georg Gleditsch wrote:

 dean gaudet [EMAIL PROTECTED] writes:
  on AMD x86 pre-family 10h the boundary is 8 bytes, and on fam 10h it's 16 
  bytes.  the penalty is a mere 3 cycles if an access crosses the specified 
  boundary.
 
 Worth noting though, is that atomic accesses that cross cache lines on
 an Opteron system is going to lock down the Hypertransport fabric for
 you during the operation -- which is obviously not so nice.

ooh awesome, i hadn't measured that before.

on a 2 node sockF / revF with a random pointer chase running on cpu 0 / 
node 0 i see the avg load-to-load cache miss latency jump from 77ns to 
109ns when i add an unaligned lock-intensive workload on one core of node 
1.  the worst i can get the pointer chase latency to is 273ns when i add 
two threads on node 1 fighting over an unaligned lock.

on a 4 node (square) the worst case i can get seems to be an increase from 
98ns with no antagonist to 385ns with 6 antagonists fighting over an 
unaligned lock on the other 3 nodes.

cool.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Documentation about unaligned memory access

2007-11-22 Thread dean gaudet

On Fri, 23 Nov 2007, Alan Cox wrote:

> Its usually faster if you don't misalign on x86 as well.

i'm not sure if i agree with "usually"... but i know you (alan) are 
probably aware of the exact requirements of the hw.

for everyone else:

on intel x86 processors an access is unaligned only if it crosses a 
cacheline boundary (64 bytes).  otherwise it's aligned.  the penalty for 
crossing a cacheline boundary varies from ~12 cycles (core2) to many 
dozens of cycles (p4).

on AMD x86 pre-family 10h the boundary is 8 bytes, and on fam 10h it's 16 
bytes.  the penalty is a mere 3 cycles if an access crosses the specified 
boundary.

if you're making <= 4 byte accesses i recommend not worrying about 
alignment on x86.  it's pretty hard to beat the hardware support.

i curse all the RISC and embedded processor designers who pretend 
unaligned accesses are something evil and to be avoided.  in case you're 
worried, MIPS patent 4,814,976 expired in december 2006 :)

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Documentation about unaligned memory access

2007-11-22 Thread dean gaudet

On Fri, 23 Nov 2007, Alan Cox wrote:

 Its usually faster if you don't misalign on x86 as well.

i'm not sure if i agree with usually... but i know you (alan) are 
probably aware of the exact requirements of the hw.

for everyone else:

on intel x86 processors an access is unaligned only if it crosses a 
cacheline boundary (64 bytes).  otherwise it's aligned.  the penalty for 
crossing a cacheline boundary varies from ~12 cycles (core2) to many 
dozens of cycles (p4).

on AMD x86 pre-family 10h the boundary is 8 bytes, and on fam 10h it's 16 
bytes.  the penalty is a mere 3 cycles if an access crosses the specified 
boundary.

if you're making = 4 byte accesses i recommend not worrying about 
alignment on x86.  it's pretty hard to beat the hardware support.

i curse all the RISC and embedded processor designers who pretend 
unaligned accesses are something evil and to be avoided.  in case you're 
worried, MIPS patent 4,814,976 expired in december 2006 :)

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch][v2] x86, ptrace: support for branch trace store(BTS)

2007-11-20 Thread dean gaudet

On Tue, 20 Nov 2007, dean gaudet wrote:

> On Tue, 20 Nov 2007, Metzger, Markus T wrote:
> 
> > +__cpuinit void ptrace_bts_init_intel(struct cpuinfo_x86 *c)
> > +{
> > +   switch (c->x86) {
> > +   case 0x6:
> > +   switch (c->x86_model) {
> > +#ifdef __i386__
> > +   case 0xD:
> > +   case 0xE: /* Pentium M */
> > +   ptrace_bts_ops = ptrace_bts_ops_pentium_m;
> > +   break;
> > +#endif /* _i386_ */
> > +   case 0xF: /* Core2 */
> > +   ptrace_bts_ops = ptrace_bts_ops_core2;
> > +   break;
> > +   default:
> > +   /* sorry, don't know about them */
> > +   break;
> > +   }
> > +   break;
> > +   case 0xF:
> > +   switch (c->x86_model) {
> > +#ifdef __i386__
> > +   case 0x0:
> > +   case 0x1:
> > +   case 0x2:
> > +   case 0x3: /* Netburst */
> > +   ptrace_bts_ops = ptrace_bts_ops_netburst;
> > +   break;
> > +#endif /* _i386_ */
> > +   default:
> > +   /* sorry, don't know about them */
> > +   break;
> > +   }
> > +   break;
> 
> is this right?  i thought intel family 15 models 3 and 4 supported amd64
> mode...

actually... why aren't you using cpuid level 1 edx bit 21 to 
enable/disable this feature?  isn't that the bit defined to indicate 
whether this feature is supported or not?

and it seems like this patch and perfmon2 are going to have to live with 
each other... since they both require the use of the DS save area...

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch][v2] x86, ptrace: support for branch trace store(BTS)

2007-11-20 Thread dean gaudet

On Tue, 20 Nov 2007, Metzger, Markus T wrote:

> +__cpuinit void ptrace_bts_init_intel(struct cpuinfo_x86 *c)
> +{
> + switch (c->x86) {
> + case 0x6:
> + switch (c->x86_model) {
> +#ifdef __i386__
> + case 0xD:
> + case 0xE: /* Pentium M */
> + ptrace_bts_ops = ptrace_bts_ops_pentium_m;
> + break;
> +#endif /* _i386_ */
> + case 0xF: /* Core2 */
> + ptrace_bts_ops = ptrace_bts_ops_core2;
> + break;
> + default:
> + /* sorry, don't know about them */
> + break;
> + }
> + break;
> + case 0xF:
> + switch (c->x86_model) {
> +#ifdef __i386__
> + case 0x0:
> + case 0x1:
> + case 0x2:
> + case 0x3: /* Netburst */
> + ptrace_bts_ops = ptrace_bts_ops_netburst;
> + break;
> +#endif /* _i386_ */
> + default:
> + /* sorry, don't know about them */
> + break;
> + }
> + break;

is this right?  i thought intel family 15 models 3 and 4 supported amd64
mode...

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv3 0/4] sys_indirect system call

2007-11-20 Thread dean gaudet

On Mon, 19 Nov 2007, Ingo Molnar wrote:

> 
> * Eric Dumazet <[EMAIL PROTECTED]> wrote:
> 
> > I do see a problem, because some readers will take your example as a 
> > reference, as it will probably sit in a page that 
> > google^Wsearch_engines will bring at the top of search results for 
> > next ten years or so.
> > 
> > (I bet for "sys_indirect syscall" -> http://lwn.net/Articles/258708/ )
> > 
> > Next time you post it, please warn users that it will break in some 
> > years, or state clearly this should only be used internally by glibc.
> 
> dont be silly, next time Ulrich should also warn everyone that running 
> attachments and applying patches from untrusted sources is dangerous?
> 
> any code that includes:
> 
>   fd = syscall (__NR_indirect, , , sizeof (i));
> 
> is by definition broken and unportable in every sense of the word. Apps 
> will use the proper glibc interfaces (if it's exposed).

as an application writer how do i access accept(2) with FD_CLOEXEC 
functionality?  will glibc expose an accept2() with a flags param?  if 
so... why don't we just have an accept2() syscall?

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv3 0/4] sys_indirect system call

2007-11-20 Thread dean gaudet

On Mon, 19 Nov 2007, Ingo Molnar wrote:

 
 * Eric Dumazet [EMAIL PROTECTED] wrote:
 
  I do see a problem, because some readers will take your example as a 
  reference, as it will probably sit in a page that 
  google^Wsearch_engines will bring at the top of search results for 
  next ten years or so.
  
  (I bet for sys_indirect syscall - http://lwn.net/Articles/258708/ )
  
  Next time you post it, please warn users that it will break in some 
  years, or state clearly this should only be used internally by glibc.
 
 dont be silly, next time Ulrich should also warn everyone that running 
 attachments and applying patches from untrusted sources is dangerous?
 
 any code that includes:
 
   fd = syscall (__NR_indirect, r, i, sizeof (i));
 
 is by definition broken and unportable in every sense of the word. Apps 
 will use the proper glibc interfaces (if it's exposed).

as an application writer how do i access accept(2) with FD_CLOEXEC 
functionality?  will glibc expose an accept2() with a flags param?  if 
so... why don't we just have an accept2() syscall?

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch][v2] x86, ptrace: support for branch trace store(BTS)

2007-11-20 Thread dean gaudet

On Tue, 20 Nov 2007, dean gaudet wrote:

 On Tue, 20 Nov 2007, Metzger, Markus T wrote:
 
  +__cpuinit void ptrace_bts_init_intel(struct cpuinfo_x86 *c)
  +{
  +   switch (c-x86) {
  +   case 0x6:
  +   switch (c-x86_model) {
  +#ifdef __i386__
  +   case 0xD:
  +   case 0xE: /* Pentium M */
  +   ptrace_bts_ops = ptrace_bts_ops_pentium_m;
  +   break;
  +#endif /* _i386_ */
  +   case 0xF: /* Core2 */
  +   ptrace_bts_ops = ptrace_bts_ops_core2;
  +   break;
  +   default:
  +   /* sorry, don't know about them */
  +   break;
  +   }
  +   break;
  +   case 0xF:
  +   switch (c-x86_model) {
  +#ifdef __i386__
  +   case 0x0:
  +   case 0x1:
  +   case 0x2:
  +   case 0x3: /* Netburst */
  +   ptrace_bts_ops = ptrace_bts_ops_netburst;
  +   break;
  +#endif /* _i386_ */
  +   default:
  +   /* sorry, don't know about them */
  +   break;
  +   }
  +   break;
 
 is this right?  i thought intel family 15 models 3 and 4 supported amd64
 mode...

actually... why aren't you using cpuid level 1 edx bit 21 to 
enable/disable this feature?  isn't that the bit defined to indicate 
whether this feature is supported or not?

and it seems like this patch and perfmon2 are going to have to live with 
each other... since they both require the use of the DS save area...

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch][v2] x86, ptrace: support for branch trace store(BTS)

2007-11-20 Thread dean gaudet

On Tue, 20 Nov 2007, Metzger, Markus T wrote:

 +__cpuinit void ptrace_bts_init_intel(struct cpuinfo_x86 *c)
 +{
 + switch (c-x86) {
 + case 0x6:
 + switch (c-x86_model) {
 +#ifdef __i386__
 + case 0xD:
 + case 0xE: /* Pentium M */
 + ptrace_bts_ops = ptrace_bts_ops_pentium_m;
 + break;
 +#endif /* _i386_ */
 + case 0xF: /* Core2 */
 + ptrace_bts_ops = ptrace_bts_ops_core2;
 + break;
 + default:
 + /* sorry, don't know about them */
 + break;
 + }
 + break;
 + case 0xF:
 + switch (c-x86_model) {
 +#ifdef __i386__
 + case 0x0:
 + case 0x1:
 + case 0x2:
 + case 0x3: /* Netburst */
 + ptrace_bts_ops = ptrace_bts_ops_netburst;
 + break;
 +#endif /* _i386_ */
 + default:
 + /* sorry, don't know about them */
 + break;
 + }
 + break;

is this right?  i thought intel family 15 models 3 and 4 supported amd64
mode...

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 4/4] first use of sys_indirect system call

2007-11-16 Thread dean gaudet

On Fri, 16 Nov 2007, Ulrich Drepper wrote:

> dean gaudet wrote:
> > honestly i think there should be a per-task flag which indicates whether 
> > fds are by default F_CLOEXEC or not.  my reason:  third party libraries.
> 
> Only somebody who thinks exclusively about applications as opposed to
> runtimes/libraries can say something like that.  Library writers don't
> have the luxury of being able to modify any global state.  This has all
> been discussed here before.

only someone who thinks about writing libraries can say something like 
that.  you've solved the problem for yourself, and for well written 
applications, but not for the other 99.% of libraries out there.

i'm not suggesting the library set the global flag.  i'm suggesting that 
me as an app writer will do so.

it seems like both methods are useful.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 4/4] first use of sys_indirect system call

2007-11-16 Thread dean gaudet

you know... i understand the need for FD_CLOEXEC -- in fact i tried 
petitioning for CLOEXEC options to all the fd creating syscalls something 
like 7 years ago when i was banging my head against the wall trying to 
figure out how to thread apache... but even still i'm not convinced that 
extending every system call which creates an fd is the way to do this.  
honestly i think there should be a per-task flag which indicates whether 
fds are by default F_CLOEXEC or not.  my reason:  third party libraries.

i can control all my own code in a threaded program, but i can't control 
all the code which is linked in.  fds are going to leak.

if i set a per task flag then the only thing which would break are third 
party libraries which use fork/exec and aren't aware they may need to 
unset F_CLOEXEC.  personally i'd rather break that than leak fds to 
another program.

but hey i'm happy to see this sort of thing is finally being fixed, 
thanks.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: perfmon2 merge news

2007-11-16 Thread dean gaudet

On Fri, 16 Nov 2007, Andi Kleen wrote:

> I didn't see a clear list. 

- cross platform extensible API for configuring perf counters
- support for multiplexed counters
- support for virtualized 64-bit counters
- support for PC and call graph sampling at specific intervals
- support for reading counters not necessarily with sampling
- taskswitch support for counters
- API available from userland
- ability to self-monitor: need select/poll/etc interface
- support for PEBS, IBS and whatever other new perf monitoring 
  infrastructure the vendors through at us in the future
- low overhead:  must minimize the "probe effect" of monitoring
- low noise in measurements:  cannot achieve this in userland

permon2 has all of this and more i've probably neglected...

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 4/4] first use of sys_indirect system call

2007-11-16 Thread dean gaudet

On Fri, 16 Nov 2007, Ulrich Drepper wrote:

 dean gaudet wrote:
  honestly i think there should be a per-task flag which indicates whether 
  fds are by default F_CLOEXEC or not.  my reason:  third party libraries.
 
 Only somebody who thinks exclusively about applications as opposed to
 runtimes/libraries can say something like that.  Library writers don't
 have the luxury of being able to modify any global state.  This has all
 been discussed here before.

only someone who thinks about writing libraries can say something like 
that.  you've solved the problem for yourself, and for well written 
applications, but not for the other 99.% of libraries out there.

i'm not suggesting the library set the global flag.  i'm suggesting that 
me as an app writer will do so.

it seems like both methods are useful.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: perfmon2 merge news

2007-11-16 Thread dean gaudet

On Fri, 16 Nov 2007, Andi Kleen wrote:

 I didn't see a clear list. 

- cross platform extensible API for configuring perf counters
- support for multiplexed counters
- support for virtualized 64-bit counters
- support for PC and call graph sampling at specific intervals
- support for reading counters not necessarily with sampling
- taskswitch support for counters
- API available from userland
- ability to self-monitor: need select/poll/etc interface
- support for PEBS, IBS and whatever other new perf monitoring 
  infrastructure the vendors through at us in the future
- low overhead:  must minimize the probe effect of monitoring
- low noise in measurements:  cannot achieve this in userland

permon2 has all of this and more i've probably neglected...

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 4/4] first use of sys_indirect system call

2007-11-16 Thread dean gaudet

you know... i understand the need for FD_CLOEXEC -- in fact i tried 
petitioning for CLOEXEC options to all the fd creating syscalls something 
like 7 years ago when i was banging my head against the wall trying to 
figure out how to thread apache... but even still i'm not convinced that 
extending every system call which creates an fd is the way to do this.  
honestly i think there should be a per-task flag which indicates whether 
fds are by default F_CLOEXEC or not.  my reason:  third party libraries.

i can control all my own code in a threaded program, but i can't control 
all the code which is linked in.  fds are going to leak.

if i set a per task flag then the only thing which would break are third 
party libraries which use fork/exec and aren't aware they may need to 
unset F_CLOEXEC.  personally i'd rather break that than leak fds to 
another program.

but hey i'm happy to see this sort of thing is finally being fixed, 
thanks.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread dean gaudet

On Thu, 15 Nov 2007, Paul Mackerras wrote:

> dean gaudet writes:
> 
> > actually multiplexing is the main feature i am in need of. there are an 
> > insufficient number of counters (even on k8 with 4 counters) to do 
> > complete stall accounting or to get a general overview of L1d/L1i/L2 cache 
> > hit rates, average miss latency, time spent in various stalls, and the 
> > memory system utilization (or HT bus utilization).  this runs out to 
> > something like 30 events which are interesting... and re-running a 
> > benchmark over and over just to get around the lack of multiplexing is a 
> > royal pain in the ass.
> 
> So by "multiplexing" do you mean the ability to have multiple event
> sets associated with a context and have the kernel switch between them
> automatically?

yep.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread dean gaudet

On Wed, 14 Nov 2007, Andi Kleen wrote:

> Later a syscall might be needed with event multiplexing, but that seems
> more like a far away non essential feature.

actually multiplexing is the main feature i am in need of. there are an 
insufficient number of counters (even on k8 with 4 counters) to do 
complete stall accounting or to get a general overview of L1d/L1i/L2 cache 
hit rates, average miss latency, time spent in various stalls, and the 
memory system utilization (or HT bus utilization).  this runs out to 
something like 30 events which are interesting... and re-running a 
benchmark over and over just to get around the lack of multiplexing is a 
royal pain in the ass.

it's not a "far away non-essential feature" to me.  it's something i would 
use daily if i had all the pieces together now (and i'm constrained 
because i cannot add an out-of-tree patch which adds unofficial syscalls 
to the kernel i use).

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread dean gaudet

On Wed, 14 Nov 2007, Andi Kleen wrote:

 Later a syscall might be needed with event multiplexing, but that seems
 more like a far away non essential feature.

actually multiplexing is the main feature i am in need of. there are an 
insufficient number of counters (even on k8 with 4 counters) to do 
complete stall accounting or to get a general overview of L1d/L1i/L2 cache 
hit rates, average miss latency, time spent in various stalls, and the 
memory system utilization (or HT bus utilization).  this runs out to 
something like 30 events which are interesting... and re-running a 
benchmark over and over just to get around the lack of multiplexing is a 
royal pain in the ass.

it's not a far away non-essential feature to me.  it's something i would 
use daily if i had all the pieces together now (and i'm constrained 
because i cannot add an out-of-tree patch which adds unofficial syscalls 
to the kernel i use).

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [perfmon] Re: [perfmon2] perfmon2 merge news

2007-11-14 Thread dean gaudet

On Thu, 15 Nov 2007, Paul Mackerras wrote:

 dean gaudet writes:
 
  actually multiplexing is the main feature i am in need of. there are an 
  insufficient number of counters (even on k8 with 4 counters) to do 
  complete stall accounting or to get a general overview of L1d/L1i/L2 cache 
  hit rates, average miss latency, time spent in various stalls, and the 
  memory system utilization (or HT bus utilization).  this runs out to 
  something like 30 events which are interesting... and re-running a 
  benchmark over and over just to get around the lack of multiplexing is a 
  royal pain in the ass.
 
 So by multiplexing do you mean the ability to have multiple event
 sets associated with a context and have the kernel switch between them
 automatically?

yep.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: TCP_DEFER_ACCEPT issues

2007-11-04 Thread dean gaudet

fwiw i also brought the TCP_DEFER_ACCEPT problems up the end of last year:

http://www.mail-archive.com/[EMAIL PROTECTED]/msg28916.html

it's possible the final message in that thread is how we should define the 
behaviour, i haven't tried the TCP_SYNCNT idea though.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: TCP_DEFER_ACCEPT issues

2007-11-04 Thread dean gaudet

fwiw i also brought the TCP_DEFER_ACCEPT problems up the end of last year:

http://www.mail-archive.com/netdev@vger.kernel.org/msg28916.html

it's possible the final message in that thread is how we should define the 
behaviour, i haven't tried the TCP_SYNCNT idea though.

-dean
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TCP_DEFER_ACCEPT issues

2007-11-04 Thread dean gaudet

fwiw i also brought the TCP_DEFER_ACCEPT problems up the end of last year:

http://www.mail-archive.com/[EMAIL PROTECTED]/msg28916.html

it's possible the final message in that thread is how we should define the 
behaviour, i haven't tried the TCP_SYNCNT idea though.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Interaction between Xen and XFS: stray RW mappings

2007-10-21 Thread dean gaudet

On Sun, 21 Oct 2007, Jeremy Fitzhardinge wrote:

> dean gaudet wrote:
> > On Mon, 15 Oct 2007, Nick Piggin wrote:
> >
> >   
> >> Yes, as Dave said, vmap (more specifically: vunmap) is very expensive
> >> because it generally has to invalidate TLBs on all CPUs.
> >> 
> >
> > why is that?  ignoring 32-bit archs we have heaps of address space 
> > available... couldn't the kernel just burn address space and delay global 
> > TLB invalidate by some relatively long time (say 1 second)?
> >   
> 
> Yes, that's precisely the problem.  xfs does delay the unmap, leaving
> stray mappings, which upsets Xen.

sounds like a bug in xen to me :)

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Interaction between Xen and XFS: stray RW mappings

2007-10-21 Thread dean gaudet

On Mon, 15 Oct 2007, Nick Piggin wrote:

> Yes, as Dave said, vmap (more specifically: vunmap) is very expensive
> because it generally has to invalidate TLBs on all CPUs.

why is that?  ignoring 32-bit archs we have heaps of address space 
available... couldn't the kernel just burn address space and delay global 
TLB invalidate by some relatively long time (say 1 second)?

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Interaction between Xen and XFS: stray RW mappings

2007-10-21 Thread dean gaudet

On Mon, 15 Oct 2007, Nick Piggin wrote:

 Yes, as Dave said, vmap (more specifically: vunmap) is very expensive
 because it generally has to invalidate TLBs on all CPUs.

why is that?  ignoring 32-bit archs we have heaps of address space 
available... couldn't the kernel just burn address space and delay global 
TLB invalidate by some relatively long time (say 1 second)?

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Interaction between Xen and XFS: stray RW mappings

2007-10-21 Thread dean gaudet

On Sun, 21 Oct 2007, Jeremy Fitzhardinge wrote:

 dean gaudet wrote:
  On Mon, 15 Oct 2007, Nick Piggin wrote:
 

  Yes, as Dave said, vmap (more specifically: vunmap) is very expensive
  because it generally has to invalidate TLBs on all CPUs.
  
 
  why is that?  ignoring 32-bit archs we have heaps of address space 
  available... couldn't the kernel just burn address space and delay global 
  TLB invalidate by some relatively long time (say 1 second)?

 
 Yes, that's precisely the problem.  xfs does delay the unmap, leaving
 stray mappings, which upsets Xen.

sounds like a bug in xen to me :)

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Bug#447493: zsh missing in /etc/shells

2007-10-21 Thread dean gaudet

Package: zsh
Version: 4.3.4-23

upgrading from 4.3.4-19 to 4.3.4-23 caused zsh to be removed from 
/etc/shells... i have a nightly cron job which looks for users with 
invalid shells and it picked up this change last night after i did the 
aforementioned upgrade yesterday.

-dean



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Bug#447497: pipe viewer does not wrap long lines

2007-10-21 Thread dean gaudet

Package: alpine
Version: 0.+dfsg-1

this is a pine 4.64 - alpine 0. regression.  when a message with long 
lines is piped through an external command the lines are truncated.  i see 
no options for scrolling the display or avoiding the truncation.  note 
that regular message viewing wraps the lines...

by way of an example i've provided a line hopefully long enough to wrap on 
your display.  try comparing this message unpiped and piped through cat.  
contrast with pine 4.64 -- the piped results are wrapped.

a b c d e f g h 
i j k l m n o p 
q r s t

-dean



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Bug#446988: Acknowledgement (must compile -fno-strict-aliasing)

2007-10-20 Thread dean gaudet

i rebuilt 0.11.7-1 from source (fetched from snapshot.debian.org) and it 
seems not to be crashing (crashes were occuring in under a day before and 
i've had 0.11.7-1 going for 2 days)... so this really is a 0.11.7-1 - 
0.11.8-1 regression.  i'm going to upgrade my gcc/etc to latest bleeding 
edge and see if that changes anything.

i'm also going to upgrade an i686 box from .7 to .8 to see if this is 
amd64 specific.

-dean



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Bug#446988: Acknowledgement (must compile -fno-strict-aliasing)

2007-10-18 Thread dean gaudet

damn... -fno-strict-aliasing isn't enough to fix the crash i started 
seeing in 0.11.8.  i built my own package, but saw a crash within 24h.

-dean



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Bug#446988: must compile -fno-strict-aliasing

2007-10-17 Thread dean gaudet

Package: libtorrent10
Version: 0.11.8-1

between 0.11.7 and 0.11.8-1 i started getting regular crashes starting 
with:

** glibc detected *** /usr/bin/rtorrent: double free or corruption
(!prev): 0x0b0952b0 ***

this is on amd64.

i looked at the known issues page and it requires -fno-strict-aliasing, 
but that's not set in the debian/rules.

http://libtorrent.rakshasa.no/wiki/LibTorrentKnownIssues

-dean



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Bug#444364: please stop rewriting all the initrds

2007-09-28 Thread dean gaudet

On Fri, 28 Sep 2007, martin f krafft wrote:

 also sprach dean gaudet [EMAIL PROTECTED] [2007.09.28.0230 +0100]:
  it is EXEPTIONALLY DANGEROUS to replace EVERY SINGLE initrd when mdadm is 
  installed/upgraded.
 
 Please STOP SCREAMING and look at the existing bugs before you reply
 new ones. 2.6.3-1 will not do this anymore. You could help testing:

i did search but obviously didn't search for the right things, alas.

  but no this time i'll have to resort to a recovery CD.
 
 There are backups of the initrds. Plus, I tend to make sure your
 initrd will not get corrupted.

unfortunately it happened on a box where i upgrade on unstable frequently 
but reboot infrequently... so the .bak had already been overwritten.  (in 
the end it was my own configuration problem which resulted in the initrd 
being unbootable).

i think i might make some @reboot cron job which saves away a copy of 
/boot/initrd-`uname -r` after a successful boot, so i always have 
something to fall back on.

thanks
-dean



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Bug#444364: please stop rewriting all the initrds

2007-09-27 Thread dean gaudet

Package: mdadm
Version: 2.6.2-2

it is EXEPTIONALLY DANGEROUS to replace EVERY SINGLE initrd when mdadm is 
installed/upgraded.

you pretty much guarantee that any problem will produce an unbootable 
system -- especially if root is on md.

as has just occured to me.

in the past in this situation i could easily go back to an old kernel 
version which could still boot my system fine *because its initrd hadn't 
been broken as well*.

but no this time i'll have to resort to a recovery CD.

-dean



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Re: Intel Memory Ordering White Paper

2007-09-08 Thread dean gaudet

On Sat, 8 Sep 2007, Petr Vandrovec wrote:

> dean gaudet wrote:
> > On Sun, 9 Sep 2007, Nick Piggin wrote:
> > 
> > > I've also heard that string operations do not follow the normal ordering,
> > > but
> > > that's just with respect to individual loads/stores in the one operation,
> > > I
> > > hope? And they will still follow ordering rules WRT surrounding loads and
> > > stores?
> > 
> > see section 7.2.3 of intel volume 3A...
> > 
> > "Code dependent upon sequential store ordering should not use the string
> > operations for the entire data structure to be stored. Data and semaphores
> > should be separated. Order dependent code should use a discrete semaphore
> > uniquely stored to after any string operations to allow correctly ordered
> > data to be seen by all processors."
> > 
> > i think we need sfence after things like copy_page, clear_page, and possibly
> > copy_user... at least on intel processors with fast strings option enabled.
> 
> I do not think.  I believe that authors are trying to say that
> 
> struct { uint8 lock; uint8 data; } x;
> 
> lea (x.data),%edi
> mov $2,%ecx
> std
> rep movsb
> 
> to set both data and lock does not guarantee that x.lock will be set after
> x.data and that you should do
> 
> lea (x.data),%edi
> std
> movsb
> movsb  # or mov (%esi),%al; mov %al,(%edi), but movsb looks discrete enough to
> me
> 
> instead (and yes, I know that my example is silly).

no it's worse than that -- intel fast string stores can become globally 
visible in any order at all w.r.t. normal loads or stores... so take all 
those great examples in their recent whitepaper and throw out all the 
ordering guarantees for addresses on different cachelines if any of the 
stores are rep string.

for example transitive store ordering for locations on multiple cachelines 
is not guaranteed at all.  the kernel could return a zero page and one 
core could see the zeroes out of order with another core performing some 
sort of lockless data structure operation.

fast strings don't break ordering from the point of view of the core 
performing the rep string operation, but externally there are no 
guarantees (it's right there in the docs).

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Intel Memory Ordering White Paper

2007-09-08 Thread dean gaudet

On Sun, 9 Sep 2007, Nick Piggin wrote:

> I've also heard that string operations do not follow the normal ordering, but
> that's just with respect to individual loads/stores in the one operation, I
> hope? And they will still follow ordering rules WRT surrounding loads and
> stores?

see section 7.2.3 of intel volume 3A...

"Code dependent upon sequential store ordering should not use the string 
operations for the entire data structure to be stored. Data and semaphores 
should be separated. Order dependent code should use a discrete semaphore 
uniquely stored to after any string operations to allow correctly ordered 
data to be seen by all processors."

i think we need sfence after things like copy_page, clear_page, and 
possibly copy_user... at least on intel processors with fast strings 
option enabled.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Intel Memory Ordering White Paper

2007-09-08 Thread dean gaudet

On Sun, 9 Sep 2007, Nick Piggin wrote:

 I've also heard that string operations do not follow the normal ordering, but
 that's just with respect to individual loads/stores in the one operation, I
 hope? And they will still follow ordering rules WRT surrounding loads and
 stores?

see section 7.2.3 of intel volume 3A...

Code dependent upon sequential store ordering should not use the string 
operations for the entire data structure to be stored. Data and semaphores 
should be separated. Order dependent code should use a discrete semaphore 
uniquely stored to after any string operations to allow correctly ordered 
data to be seen by all processors.

i think we need sfence after things like copy_page, clear_page, and 
possibly copy_user... at least on intel processors with fast strings 
option enabled.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Intel Memory Ordering White Paper

2007-09-08 Thread dean gaudet

On Sat, 8 Sep 2007, Petr Vandrovec wrote:

 dean gaudet wrote:
  On Sun, 9 Sep 2007, Nick Piggin wrote:
  
   I've also heard that string operations do not follow the normal ordering,
   but
   that's just with respect to individual loads/stores in the one operation,
   I
   hope? And they will still follow ordering rules WRT surrounding loads and
   stores?
  
  see section 7.2.3 of intel volume 3A...
  
  Code dependent upon sequential store ordering should not use the string
  operations for the entire data structure to be stored. Data and semaphores
  should be separated. Order dependent code should use a discrete semaphore
  uniquely stored to after any string operations to allow correctly ordered
  data to be seen by all processors.
  
  i think we need sfence after things like copy_page, clear_page, and possibly
  copy_user... at least on intel processors with fast strings option enabled.
 
 I do not think.  I believe that authors are trying to say that
 
 struct { uint8 lock; uint8 data; } x;
 
 lea (x.data),%edi
 mov $2,%ecx
 std
 rep movsb
 
 to set both data and lock does not guarantee that x.lock will be set after
 x.data and that you should do
 
 lea (x.data),%edi
 std
 movsb
 movsb  # or mov (%esi),%al; mov %al,(%edi), but movsb looks discrete enough to
 me
 
 instead (and yes, I know that my example is silly).

no it's worse than that -- intel fast string stores can become globally 
visible in any order at all w.r.t. normal loads or stores... so take all 
those great examples in their recent whitepaper and throw out all the 
ordering guarantees for addresses on different cachelines if any of the 
stores are rep string.

for example transitive store ordering for locations on multiple cachelines 
is not guaranteed at all.  the kernel could return a zero page and one 
core could see the zeroes out of order with another core performing some 
sort of lockless data structure operation.

fast strings don't break ordering from the point of view of the core 
performing the rep string operation, but externally there are no 
guarantees (it's right there in the docs).

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFT][PATCH v7] sata_mv: convert to new EH

2007-09-06 Thread dean gaudet

On Fri, 13 Jul 2007, greg wrote:

 dean gaudet dean at arctic.org writes:
  if you've got any other workload you'd like me to throw at it, 
  let me know.  
 
 I've had a few problems with the driver in 2.6.20 (fc6xen x86_64). The 
 machine 
 tended to lock up after a random period of time (from a few minutes upwards), 
 without any messages. Performing a smartctl on all the disks, or leaving 
 smartd 
 running, seemed to speed up the rate at which the crash occurred. What I 
 found 
 was that by moving the sata_mv device onto it's own bus (or a bus with two 
 sata_mv devices), the crashes went away. Are you doing tests with the 
 controller sharing a bus with other devices?
 
 Is there an merit to my observation that it might be an issue with devices 
 sharing a PCI-X bus?
 
 Cards: Supermicro 5081 (SAT-MV8), Supermicro 6081 (SAT2-MV8), Highpoint 5081 
 (RocketRaid 1820A v1.1). Motherboards: Tyan S2882, AMD 8131 chipset; IBM 
 x206, 
 Intel 6300ESB.

hmm!  i don't seem to have replied to this.

you know, i've seen this problem.  the first time it happened was with a 
promise ultra tx/100 or tx/133 (on a dual k7 box, two controllers on the 
same bus certainly)... a 5 minute cronjob logging HD temperatures via 
smart would occasionally cause one of the disks to just disappear, return 
errors on every request, and required a reboot to rediscover it.  
eliminating the cronjob stopped the problem.

i switched to 3ware 750x and the problem went away even with the cronjob 
going.

forward a few years and i ran into the same problem with a 3ware 9550sx 
(only card on the bus) -- and a firmware upgrade to the controller 
eventually fixed the problem.

but yeah, i've been meaning to add a smartctl -a once every 10 seconds 
to my burn-in process because of these experiences... but haven't built a 
new server in a while.

the particular box i was testing sata_mv on (tyan s2881) has every pci-x 
slot filled with one thing or another, but i only have one sata_mv device.  
if i get around to testing again i'll throw smartctl into the mix.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 5/5] x86: Set PCI config space size to extended for AMD Barcelona

2007-09-03 Thread dean gaudet

it's so very unfortunate the PCI standard has no feature bit to indicate 
the presence of ECS.

FWIW in my testing on a range of machines spanning 7 or 8 years i could 
read config space reg 256... and get 0x when the device didn't 
support ECS, and get valid data when the device did support ECS... granted 
there may be some system out there which behaves really badly when you do 
this.

perhaps someone could write a userspace program and test that concept on a 
far wider range of machines.

-dean

On Mon, 3 Sep 2007, Robert Richter wrote:

> This patch sets the config space size for AMD Barcelona PCI devices to
> 4096.
> 
> Signed-off-by: Robert Richter <[EMAIL PROTECTED]>
> 
> ---
>  arch/i386/pci/fixup.c |   14 ++
>  1 file changed, 14 insertions(+)
> 
> Index: linux-2.6/arch/i386/pci/fixup.c
> ===
> --- linux-2.6.orig/arch/i386/pci/fixup.c
> +++ linux-2.6/arch/i386/pci/fixup.c
> @@ -8,6 +8,7 @@
>  #include 
>  #include "pci.h"
>  
> +#define PCI_CFG_SPACE_EXP_SIZE   4096
>  
>  static void __devinit pci_fixup_i450nx(struct pci_dev *d)
>  {
> @@ -444,3 +445,16 @@ static void __devinit pci_siemens_interr
>  }
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_SIEMENS, 0x0015,
> pci_siemens_interrupt_controller);
> +
> +/*
> + * Extend size of PCI configuration space for AMD CPUs
> + */
> +static void __devinit pci_ext_cfg_space_access(struct pci_dev *dev)
> +{
> + dev->cfg_size = PCI_CFG_SPACE_EXP_SIZE;
> +}
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_FAM10H_HT,   
> pci_ext_cfg_space_access);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_FAM10H_MAP,  
> pci_ext_cfg_space_access);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_FAM10H_DRAM, 
> pci_ext_cfg_space_access);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_FAM10H_MISC, 
> pci_ext_cfg_space_access);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_FAM10H_LINK, 
> pci_ext_cfg_space_access);
> 
> -- 
> AMD Saxony, Dresden, Germany
> Operating System Research Center
> email: [EMAIL PROTECTED]
> 
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1643 matches

Mail list logo