from:"Dean Gaudet"

Re: md autodetect only detects one disk in raid1

2007-01-27 Thread dean gaudet

take a look at your mdadm.conf ... both on your root fs and in your 
initrd... look for a DEVICES line and make sure it says DEVICES 
partitions... anything else is likely to cause problems like below.

also make sure each array is specified by UUID rather than device.

and then rebuild your initrd.  (dpkg-reconfigure linux-image-`uname -r` on 
debuntu).

that something else in the system claim use of the device problem makes 
me guess you're on ubuntu pre-edgy... where for whatever reason they 
included evms in the default install and for whatever inane reason evms 
steals every damn device in the system when it starts up.  
uninstall/deactivate evms if you're not using it.

-dean

On Sat, 27 Jan 2007, kenneth johansson wrote:

 I run raid1 on my root partition /dev/md0. Now I had a bad disk so I had
 to replace it but did not notice until I got home that I got a SATA
 instead of a PATA. Since I had a free sata interface I just put in in
 that. I had no problem adding the disk to the raid1 device that is until
 I rebooted the computer. 
 
 both the PATA disk and the SATA disk are detected before md start up the
 raid but only the PATA disk is activated. So the raid device is always
 booting in degraded mode. since this is the root disk I use the
 autodetect feature with partition type fd.
 
 Also Something else in the system claim use of the device since I can
 not add the SATA disk after the system has done a complete boot. I guess
 it has something to do with device mapper and LVM that I also run on the
 data disks but I'm not sure. any tip on what it can be??
 
 If I add the SATA disk to md0 early enough in the boot it works but why
 is it not autodetected ?
 
 
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [rdiff-backup-users] More patches to get rdiff-backup working under cygwin/windows

2007-01-27 Thread dean gaudet

On Fri, 26 Jan 2007, Marc Dyksterhouse wrote:

   http://www.visiwave.com/download/rdiff_backup/rpath.py.patch

can you provide more information on why this is necessary?  i'm assuming 
it's because cygwin/windows can't do an fsync in some situation...

would it be possible to put another try/except around the os.fsync to 
catch that case instead of just disabling the fsync entirely?  i don't 
think i want to commit this patch as is... unless that fsync really isn't 
necessary.


   http://www.visiwave.com/download/rdiff_backup/Security.py.patch

committed to cvs HEAD


   http://www.visiwave.com/download/rdiff_backup/FilenameMapping.py.patch

committed to cvs HEAD

   http://www.visiwave.com/download/rdiff_backup/fs_abilities.py.patch

hmm i'm committing this anyhow because i didn't notice the previous patch 
depends on it... next time send them in order please :)

but -- can you expand on this chunk:

-   else: return ^a-z0-9_ -. # quote everything but basic 
chars
+   else: return ^a-z0-9_ .- # quote everything but basic 
chars
 
if self.dest_fsa.extended_filenames:
return  # Don't quote anything
-   else: return ^A-Za-z0-9_ -.
+   else: return ^A-Za-z0-9_ .-

 -. is a valid range... so this change will start escaping those things 
except for
space dot dash... this was intentional?

oh and... i tested none of this.  i encourage folks to grab the cvs head 
and report back if it's broken or not.

-dean


___
rdiff-backup-users mailing list at rdiff-backup-users@nongnu.org
http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki

Re: [rdiff-backup-users] Re: How to Escape Globbing Patterns?

2007-01-27 Thread dean gaudet



On Mon, 22 Jan 2007, Dave Howorth wrote:

 Andrew Price wrote:
  On 12/01/07 11:57, Andrew Price wrote:
  I'm using --include-globbing-filelist. If I wanted to specify a file in
  a file list that has [] in the file name, e.g. myfile[foo].txt, how
  would I escape the square brackets so that they aren't treated as a
  globbing pattern?
  
  I apologise for bumping this thread. Does somebody have an answer to
  the above question? Even if it's it can't be done, you should work
  around it it would be helpful to me. I'm just trying to close a bug in
  a project that relates to it. Let me know if the question is worded badly.
 
 I'm new to rdiff-backup and I don't know python, so I'm not sure I'll be
 much help, but I guess any answer is better than none. The code that
 implements this appears to be in selection.py and it says:
 
   def glob_to_re(self, pat):
 Returned regular expression equivalent to shell glob pat
 
 Currently only the ?, *, [], and ** expressions are supported.
 Ranges like [a-z] are also currently unsupported.  There is no
 way to quote these special characters.
 
 So I guess the answer is it can't be done, or rather, you'll need to
 hack this code.

patches welcome.  include a man page patch as well.

-dean


___
rdiff-backup-users mailing list at rdiff-backup-users@nongnu.org
http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki

Re: why would EPIPE cause socket port to change?

2007-01-23 Thread dean gaudet

On Tue, 23 Jan 2007, Rick Jones wrote:

 Herbert Xu wrote:
  Prior to the last write, the socket entered the CLOSED state meaning
  that the old port is no longer allocated to it.  As a result, the
  last write operates on an unconnected socket which causes a new local
  port to be allocated as an autobind.  It then fails because the socket
  is still not connected.
  
  So any attempt to run getsockname after an error on the socket is
  simply buggy.
 
 But falls within the principle of least surprise doesn't it?  Unless the
 application has called close() or bind(), it does seem like a reasonable
 expectation that the port assignments are not changed.

i sampled a few other OSes...

netbsd returns EINVAL after close
freebsd returns ECONNRESET after close
OSX retains the same port number
solaris 10 returns port 0

actually any of those behaviours seems more appropriate than randomly 
assigning a new port :)  but i like the ENOTCONN suggestion from Michael 
Tokarev the best... it matches the ENOTCONN from getpeername.


  (fwiw this is one of two reasons i've found for libnss-ldap to leak
  sockets... causing nscd to crash.)
 
 Of course, that seems rather odd too - why does libnss-ldap check the socket
 name on a socket after an EPIPE anyway?

libnss-ldap has some code which attempts to determine if its private 
socket has been trampled on in between calls to the library... and to do 
this it caches getsockname/getpeername results and compares them every 
time the library is re-entered... and when there's a mismatch it leaks a 
socket (eventually crashing nscd if you're using that).  i've been trying 
to band-aid over the problem:

http://bugzilla.padl.com/show_bug.cgi?id=304
http://bugzilla.padl.com/show_bug.cgi?id=305

but i'm probably going to need to approach it from another direction -- 
make libnss-ldap monitor the ldap library results so it knows when there's 
been a read/write error so that it stops doing this 
getsockname/getpeername thing after the error has occured.

-dean
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: why would EPIPE cause socket port to change?

2007-01-23 Thread dean gaudet

On Tue, 23 Jan 2007, David Miller wrote:

 From: dean gaudet [EMAIL PROTECTED]
 Date: Tue, 23 Jan 2007 12:11:01 -0800 (PST)

  libnss-ldap has some code which attempts to determine if its private 
  socket has been trampled on in between calls to the library... and to do 
  this it caches getsockname/getpeername results and compares them every 
  time the library is re-entered... and when there's a mismatch it leaks a 
  socket (eventually crashing nscd if you're using that).  i've been trying 
  to band-aid over the problem:

  http://bugzilla.padl.com/show_bug.cgi?id=304
  http://bugzilla.padl.com/show_bug.cgi?id=305

  but i'm probably going to need to approach it from another direction -- 
  make libnss-ldap monitor the ldap library results so it knows when there's 
  been a read/write error so that it stops doing this 
  getsockname/getpeername thing after the error has occured.

 Please do not write programs in this way.  getsockname/getpeername
 were never meant to be used in that way, and it's hella inefficient
 to keep checking the socket like that to boot.

 I really don't see you gaining anything by making this check every
 time the user calls into the library.

 If the application mucks with the communications channel socket, so
 what, it's his application that will go tits up.

 Is there some tricky interaction between nscd and something like
 libnss-ldap that makes this tom-foolery necessary?

oh heck yeah i totally agree -- it's not my code though, i'm just 
debugging it.

-dean
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] faster vgetcpu using sidt (take 2)

2007-01-22 Thread dean gaudet

On Thu, 18 Jan 2007, Andi Kleen wrote:

> > let me know what you think... thanks.
> 
> It's ok, although I would like to have the file in a separate directory.

cool -- do you have a directory in mind?

and would you like this change as two separate patches or one combined 
patch?

thanks
-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

why would EPIPE cause socket port to change?

2007-01-22 Thread dean gaudet

in the test program below the getsockname result on a TCP socket changes 
across a write which produces EPIPE... here's a fragment of the strace:

getsockname(3, {sa_family=AF_INET, sin_port=htons(37636), 
sin_addr=inet_addr(127.0.0.1)}, [17863593746633850896]) = 0
...
write(3, hi!\n, 4)= 4
write(3, hi!\n, 4)= -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
getsockname(3, {sa_family=AF_INET, sin_port=htons(59882), 
sin_addr=inet_addr(127.0.0.1)}, [16927060683038654480]) = 0

why does the port# change?  this is on 2.6.19.1.

(fwiw this is one of two reasons i've found for libnss-ldap to leak 
sockets... causing nscd to crash.)

-dean

reproduce like:

make test-sockname-change
nc -l -p  -c exit 0 
strace ./test-sockname-change 127.0.0.1 

--- snip ---

#include stdio.h
#include sys/types.h
#include sys/socket.h
#include netinet/in.h
#include netinet/tcp.h
#include stdlib.h
#include unistd.h
#include arpa/inet.h
#include sys/uio.h
#include errno.h
#include signal.h
#include fcntl.h

#ifndef INADDR_NONE
#define INADDR_NONE (-1ul)
#endif

int main(int argc, char **argv)
{
  struct sockaddr_in server_addr;
  struct sockaddr_in before, after;
  socklen_t slen;
  int s;
  struct iovec vector[3];
  char buf[100];
  int i;
  const int just_say_no = 1;

  if (argc != 3) {
usage:
fprintf(stderr, usage: test-sigpipe a.b.c.d port#\n);
exit(1);
  }
  server_addr.sin_family = AF_INET;
  server_addr.sin_addr.s_addr = inet_addr(argv[1]);
  if (server_addr.sin_addr.s_addr == INADDR_NONE) {
fprintf(stderr, bogus address\n);
goto usage;
  }
  server_addr.sin_port = htons(atoi(argv[2]));

  s = socket(AF_INET, SOCK_STREAM, 0);
  if (s  0) {
perror(socket);
exit(1);
  }
  if (connect(s, (struct sockaddr *)server_addr, sizeof(server_addr)) != 0) {
perror(connect);
exit(1);
  }

  if (setsockopt(s, IPPROTO_TCP, TCP_NODELAY, (char*)just_say_no, 
sizeof(just_say_no)) != 0) {
perror( TCP_NODELAY );
exit(1);
  }

  fcntl(s, F_SETFL, fcntl(s, F_GETFL) | O_NONBLOCK);

  slen = sizeof(before);
  if (getsockname(s, (struct sockaddr *)before, slen)) {
perror(getsockname before);
  }

  signal(SIGPIPE, SIG_IGN);

  sleep(1);

  do {
i = write(s, hi!\n, 4);
  } while (i = 0);
  if (errno != EPIPE) {
fprintf(stderr, was expecting EPIPE from write\n);
exit(1);
  }

  slen = sizeof(after);
  if (getsockname(s, (struct sockaddr *)after, slen)) {
perror(getsockname after);
  }

  printf(before = %d, after = %d\n, ntohs(before.sin_port), 
ntohs(after.sin_port));

  return 0;
}
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] faster vgetcpu using sidt (take 2)

2007-01-22 Thread dean gaudet

On Thu, 18 Jan 2007, Andi Kleen wrote:

  let me know what you think... thanks.
 
 It's ok, although I would like to have the file in a separate directory.

cool -- do you have a directory in mind?

and would you like this change as two separate patches or one combined 
patch?

thanks
-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: bad performance on RAID 5

2007-01-18 Thread dean gaudet

On Wed, 17 Jan 2007, Sevrin Robstad wrote:

 I'm suffering from bad performance on my RAID5.
 
 a echo check /sys/block/md0/md/sync_action
 
 gives a speed at only about 5000K/sec , and HIGH load average :
 
 # uptime
 20:03:55 up 8 days, 19:55,  1 user,  load average: 11.70, 4.04, 1.52

iostat -kx /dev/sd? 10  ... and sum up the total IO... 

also try increasing sync_speed_min/max

and a loadavg jump like that suggests to me you have other things 
competing for the disk at the same time as the check.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid5 software vs hardware: parity calculations?

2007-01-15 Thread dean gaudet

On Mon, 15 Jan 2007, Robin Bowes wrote:

 I'm running RAID6 instead of RAID5+1 - I've had a couple of instances
 where a drive has failed in a RAID5+1 array and a second has failed
 during the rebuild after the hot-spare had kicked in.

if the failures were read errors without losing the entire disk (the 
typical case) then new kernels are much better -- on read error md will 
reconstruct the sectors from the other disks and attempt to write it back.

you can also run monthly checks...

echo check /sys/block/mdX/md/sync_action

it'll read the entire array (parity included) and correct read errors as 
they're discovered.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid5 software vs hardware: parity calculations?

2007-01-15 Thread dean gaudet

On Mon, 15 Jan 2007, berk walker wrote:

 dean gaudet wrote:
  echo check /sys/block/mdX/md/sync_action
  
  it'll read the entire array (parity included) and correct read errors as
  they're discovered.

 
 Could I get a pointer as to how I can do this check in my FC5 [BLAG] system?
 I can find no appropriate check, nor md available to me.  It would be a
 good thing if I were able to find potentially weak spots, rewrite them to
 good, and know that it might be time for a new drive.
 
 All of my arrays have drives of approx the same mfg date, so the possibility
 of more than one showing bad at the same time can not be ignored.

it should just be:

echo check /sys/block/mdX/md/sync_action

if you don't have a /sys/block/mdX/md/sync_action file then your kernel is 
too old... or you don't have /sys mounted... (or you didn't replace X with 
the raid number :)

iirc there were kernel versions which had the sync_action file but didn't 
yet support the check action (i think possibly even as recent as 2.6.17 
had a small bug initiating one of the sync_actions but i forget which 
one).  if you can upgrade to 2.6.18.x it should work.

debian unstable (and i presume etch) will do this for all your arrays 
automatically once a month.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid5 software vs hardware: parity calculations?

2007-01-15 Thread dean gaudet

On Mon, 15 Jan 2007, Mr. James W. Laferriere wrote:

   Hello Dean ,
 
 On Mon, 15 Jan 2007, dean gaudet wrote:
 ...snip...
  it should just be:
  
  echo check /sys/block/mdX/md/sync_action
  
  if you don't have a /sys/block/mdX/md/sync_action file then your kernel is
  too old... or you don't have /sys mounted... (or you didn't replace X with
  the raid number :)
  
  iirc there were kernel versions which had the sync_action file but didn't
  yet support the check action (i think possibly even as recent as 2.6.17
  had a small bug initiating one of the sync_actions but i forget which
  one).  if you can upgrade to 2.6.18.x it should work.
  
  debian unstable (and i presume etch) will do this for all your arrays
  automatically once a month.
  
  -dean
 
   Being able to run a 'check' is a good thing (tm) .  But without a
 method to acquire statii  data back from the check ,  Seems rather bland .
 Is there a tool/file to poll/... where data  statii can be acquired ?

i'm not 100% certain what you mean, but i generally just monitor dmesg for 
the md read error message (mind you the message pre-2.6.19 or .20 isn't 
very informative but it's obvious enough).

there is also a file mismatch_cnt in the same directory as sync_action ... 
the Documentation/md.txt (in 2.6.18) refers to it incorrectly as 
mismatch_count... but anyhow why don't i just repaste the relevant portion 
of md.txt.

-dean

...

Active md devices for levels that support data redundancy (1,4,5,6)
also have

   sync_action
 a text file that can be used to monitor and control the rebuild
 process.  It contains one word which can be one of:
   resync- redundancy is being recalculated after unclean
   shutdown or creation
   recover   - a hot spare is being built to replace a
   failed/missing device
   idle  - nothing is happening
   check - A full check of redundancy was requested and is
   happening.  This reads all block and checks
   them. A repair may also happen for some raid
   levels.
   repair- A full check and repair is happening.  This is
   similar to 'resync', but was requested by the
   user, and the write-intent bitmap is NOT used to
   optimise the process.

  This file is writable, and each of the strings that could be
  read are meaningful for writing.

   'idle' will stop an active resync/recovery etc.  There is no
   guarantee that another resync/recovery may not be automatically
   started again, though some event will be needed to trigger
   this.
'resync' or 'recovery' can be used to restart the
   corresponding operation if it was stopped with 'idle'.
'check' and 'repair' will start the appropriate process
   providing the current state is 'idle'.

   mismatch_count
  When performing 'check' and 'repair', and possibly when
  performing 'resync', md will count the number of errors that are
  found.  The count in 'mismatch_cnt' is the number of sectors
  that were re-written, or (for 'check') would have been
  re-written.  As most raid levels work in units of pages rather
  than sectors, this my be larger than the number of actual errors
  by a factor of the number of sectors in a page.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bug#406949: reduce cron job noise

2007-01-15 Thread dean gaudet

Package: htdig
Version: 1:3.2.0b6-3

i'd rather not get this message every day from cron:

/etc/cron.daily/htdig:
/etc/cron.daily/htdig: line 22:  1723 Terminated  lockfile-touch 
/var/run/htdig.cron

the patch below should quiet things.

-dean

--- etc/cron.daily/htdig.dpkg-orig  2006-10-01 09:40:22.0 -0700
+++ etc/cron.daily/htdig2007-01-15 00:29:51.0 -0800
@@ -18,5 +18,5 @@
fi
 fi
 
-kill ${BADGER}
+kill ${BADGER} /dev/null 21
 lockfile-remove /var/run/htdig.cron


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Bug#406949: reduce cron job noise

2007-01-15 Thread dean gaudet

Package: htdig
Version: 1:3.2.0b6-3

i'd rather not get this message every day from cron:

/etc/cron.daily/htdig:
/etc/cron.daily/htdig: line 22:  1723 Terminated  lockfile-touch 
/var/run/htdig.cron

the patch below should quiet things.

-dean

--- etc/cron.daily/htdig.dpkg-orig  2006-10-01 09:40:22.0 -0700
+++ etc/cron.daily/htdig2007-01-15 00:29:51.0 -0800
@@ -18,5 +18,5 @@
fi
 fi
 
-kill ${BADGER}
+kill ${BADGER} /dev/null 21
 lockfile-remove /var/run/htdig.cron


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Re: [patch] faster vgetcpu using sidt (take 2)

2007-01-14 Thread dean gaudet

On Sat, 13 Jan 2007, dean gaudet wrote:

> ok here is the latest rev of this patch (against 2.6.20-rc4).
> 
> timings in cycles:
> 
> baseline   patchedbaseline   patched
> no cache   no cachecache  cache
> k8 pre-revF2116  1417
> k8 revF3117  1417
> core2  3816  1214
> p4 4941  2424
> 
> the degredation in cached timings appears to be due to the 16 byte stack
> frame set up for the sidt instruction.  apparently due to -mno-red-zone...
> would you accept a patch which re-enables the red-zone for vsyscalls?

here is a first stab at a patch (applied on top of my vgetcpu sidt patch) 
which enables red-zone for vsyscall.  it fixes the cache degredation 
problem above by getting rid of the stack frame setup in vgetcpu (and 
improves the no cache cases as well but i haven't run it everywhere yet).

to do this i split the user-mode-only portion of vsyscall.c into 
vsyscall_user.c.  this required a couple externs in vsyscall.c and two 
extra ".globl" in the asm in vsyscall_user.c.

i'm not sure if we need the CFLAGS_vsyscall.o still or not.

let me know what you think... thanks.

-dean

Index: linux/arch/x86_64/kernel/Makefile
===
--- linux.orig/arch/x86_64/kernel/Makefile  2006-11-29 13:57:37.0 
-0800
+++ linux/arch/x86_64/kernel/Makefile   2007-01-13 23:34:22.0 -0800
@@ -6,7 +6,7 @@
 EXTRA_AFLAGS   := -traditional
 obj-y  := process.o signal.o entry.o traps.o irq.o \
ptrace.o time.o ioport.o ldt.o setup.o i8259.o sys_x86_64.o \
-   x8664_ksyms.o i387.o syscall.o vsyscall.o \
+   x8664_ksyms.o i387.o syscall.o vsyscall.o vsyscall_user.o \
setup64.o bootflag.o e820.o reboot.o quirks.o i8237.o \
pci-dma.o pci-nommu.o alternative.o
 
@@ -45,6 +45,7 @@
 obj-y  += intel_cacheinfo.o
 
 CFLAGS_vsyscall.o  := $(PROFILING) -g0
+CFLAGS_vsyscall_user.o := $(PROFILING) -g0 -mred-zone
 
 therm_throt-y   += ../../i386/kernel/cpu/mcheck/therm_throt.o
 bootflag-y += ../../i386/kernel/bootflag.o
Index: linux/arch/x86_64/kernel/vsyscall.c
===
--- linux.orig/arch/x86_64/kernel/vsyscall.c2007-01-13 22:21:01.0 
-0800
+++ linux/arch/x86_64/kernel/vsyscall.c 2007-01-13 23:41:08.0 -0800
@@ -40,161 +40,12 @@
 #include 
 #include 
 #include 
-
-#define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr)))
-#define __syscall_clobber "r11","rcx","memory"
-
-int __sysctl_vsyscall __section_sysctl_vsyscall = 1;
-seqlock_t __xtime_lock __section_xtime_lock = SEQLOCK_UNLOCKED;
-
-/* is this necessary? */
-#ifndef CONFIG_NODES_SHIFT
-#define CONFIG_NODES_SHIFT 0
-#endif
-
 #include 
 
-static __always_inline void timeval_normalize(struct timeval * tv)
-{
-   time_t __sec;
-
-   __sec = tv->tv_usec / 100;
-   if (__sec) {
-   tv->tv_usec %= 100;
-   tv->tv_sec += __sec;
-   }
-}
-
-static __always_inline void do_vgettimeofday(struct timeval * tv)
-{
-   long sequence, t;
-   unsigned long sec, usec;
-
-   do {
-   sequence = read_seqbegin(&__xtime_lock);
-   
-   sec = __xtime.tv_sec;
-   usec = __xtime.tv_nsec / 1000;
-
-   if (__vxtime.mode != VXTIME_HPET) {
-   t = get_cycles_sync();
-   if (t < __vxtime.last_tsc)
-   t = __vxtime.last_tsc;
-   usec += ((t - __vxtime.last_tsc) *
-__vxtime.tsc_quot) >> 32;
-   /* See comment in x86_64 do_gettimeofday. */
-   } else {
-   usec += ((readl((void __iomem *)
-  fix_to_virt(VSYSCALL_HPET) + 0xf0) -
- __vxtime.last) * __vxtime.quot) >> 32;
-   }
-   } while (read_seqretry(&__xtime_lock, sequence));
-
-   tv->tv_sec = sec + usec / 100;
-   tv->tv_usec = usec % 100;
-}
-
-/* RED-PEN may want to readd seq locking, but then the variable should be 
write-once. */
-static __always_inline void do_get_tz(struct timezone * tz)
-{
-   *tz = __sys_tz;
-}
-
-static __always_inline int gettimeofday(struct timeval *tv, struct timezone 
*tz)
-{
-   int ret;
-   asm volatile("vsysc2: syscall"
-   : "=a" (ret)
-   : "0" (__NR_gettimeofday),"D" (tv),"S" (tz) : __syscall_clobber 
);
-   return ret;
-}
-
-s

Re: [patch] faster vgetcpu using sidt (take 2)

2007-01-14 Thread dean gaudet

On Sat, 13 Jan 2007, dean gaudet wrote:

 ok here is the latest rev of this patch (against 2.6.20-rc4).
 
 timings in cycles:
 
 baseline   patchedbaseline   patched
 no cache   no cachecache  cache
 k8 pre-revF2116  1417
 k8 revF3117  1417
 core2  3816  1214
 p4 4941  2424
 
 the degredation in cached timings appears to be due to the 16 byte stack
 frame set up for the sidt instruction.  apparently due to -mno-red-zone...
 would you accept a patch which re-enables the red-zone for vsyscalls?

here is a first stab at a patch (applied on top of my vgetcpu sidt patch) 
which enables red-zone for vsyscall.  it fixes the cache degredation 
problem above by getting rid of the stack frame setup in vgetcpu (and 
improves the no cache cases as well but i haven't run it everywhere yet).

to do this i split the user-mode-only portion of vsyscall.c into 
vsyscall_user.c.  this required a couple externs in vsyscall.c and two 
extra .globl in the asm in vsyscall_user.c.

i'm not sure if we need the CFLAGS_vsyscall.o still or not.

let me know what you think... thanks.

-dean

Index: linux/arch/x86_64/kernel/Makefile
===
--- linux.orig/arch/x86_64/kernel/Makefile  2006-11-29 13:57:37.0 
-0800
+++ linux/arch/x86_64/kernel/Makefile   2007-01-13 23:34:22.0 -0800
@@ -6,7 +6,7 @@
 EXTRA_AFLAGS   := -traditional
 obj-y  := process.o signal.o entry.o traps.o irq.o \
ptrace.o time.o ioport.o ldt.o setup.o i8259.o sys_x86_64.o \
-   x8664_ksyms.o i387.o syscall.o vsyscall.o \
+   x8664_ksyms.o i387.o syscall.o vsyscall.o vsyscall_user.o \
setup64.o bootflag.o e820.o reboot.o quirks.o i8237.o \
pci-dma.o pci-nommu.o alternative.o
 
@@ -45,6 +45,7 @@
 obj-y  += intel_cacheinfo.o
 
 CFLAGS_vsyscall.o  := $(PROFILING) -g0
+CFLAGS_vsyscall_user.o := $(PROFILING) -g0 -mred-zone
 
 therm_throt-y   += ../../i386/kernel/cpu/mcheck/therm_throt.o
 bootflag-y += ../../i386/kernel/bootflag.o
Index: linux/arch/x86_64/kernel/vsyscall.c
===
--- linux.orig/arch/x86_64/kernel/vsyscall.c2007-01-13 22:21:01.0 
-0800
+++ linux/arch/x86_64/kernel/vsyscall.c 2007-01-13 23:41:08.0 -0800
@@ -40,161 +40,12 @@
 #include asm/segment.h
 #include asm/desc.h
 #include asm/topology.h
-
-#define __vsyscall(nr) __attribute__ ((unused,__section__(.vsyscall_ #nr)))
-#define __syscall_clobber r11,rcx,memory
-
-int __sysctl_vsyscall __section_sysctl_vsyscall = 1;
-seqlock_t __xtime_lock __section_xtime_lock = SEQLOCK_UNLOCKED;
-
-/* is this necessary? */
-#ifndef CONFIG_NODES_SHIFT
-#define CONFIG_NODES_SHIFT 0
-#endif
-
 #include asm/unistd.h
 
-static __always_inline void timeval_normalize(struct timeval * tv)
-{
-   time_t __sec;
-
-   __sec = tv-tv_usec / 100;
-   if (__sec) {
-   tv-tv_usec %= 100;
-   tv-tv_sec += __sec;
-   }
-}
-
-static __always_inline void do_vgettimeofday(struct timeval * tv)
-{
-   long sequence, t;
-   unsigned long sec, usec;
-
-   do {
-   sequence = read_seqbegin(__xtime_lock);
-   
-   sec = __xtime.tv_sec;
-   usec = __xtime.tv_nsec / 1000;
-
-   if (__vxtime.mode != VXTIME_HPET) {
-   t = get_cycles_sync();
-   if (t  __vxtime.last_tsc)
-   t = __vxtime.last_tsc;
-   usec += ((t - __vxtime.last_tsc) *
-__vxtime.tsc_quot)  32;
-   /* See comment in x86_64 do_gettimeofday. */
-   } else {
-   usec += ((readl((void __iomem *)
-  fix_to_virt(VSYSCALL_HPET) + 0xf0) -
- __vxtime.last) * __vxtime.quot)  32;
-   }
-   } while (read_seqretry(__xtime_lock, sequence));
-
-   tv-tv_sec = sec + usec / 100;
-   tv-tv_usec = usec % 100;
-}
-
-/* RED-PEN may want to readd seq locking, but then the variable should be 
write-once. */
-static __always_inline void do_get_tz(struct timezone * tz)
-{
-   *tz = __sys_tz;
-}
-
-static __always_inline int gettimeofday(struct timeval *tv, struct timezone 
*tz)
-{
-   int ret;
-   asm volatile(vsysc2: syscall
-   : =a (ret)
-   : 0 (__NR_gettimeofday),D (tv),S (tz) : __syscall_clobber 
);
-   return ret;
-}
-
-static __always_inline long time_syscall(long *t)
-{
-   long secs;
-   asm volatile(vsysc1: syscall
-   : =a (secs)
-   : 0 (__NR_time),D (t

Bug#406925: airmon-ng script depends on wireless-tools

2007-01-14 Thread dean gaudet

Package: aircrack-ng
Version: 1:0.6.2-6

the aircrack-ng should Depend on the wireless-tools package... since the 
airmon-ng script requires iwpriv/iwconfig.

thanks
-dean


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Re: [rdiff-backup-users] OverflowError reading get_sigs() response in very large tree

2007-01-14 Thread dean gaudet

i wonder why i don't have this problem on million+ inode systems... i've 
had it working with python2.3 before too (although i'm using 2.4 now)... 
i'm still on 1.0.x.

is there a 32/64-bit mismatch between your hosts?  any chance there's some 
bug in that?

-dean

On Sat, 13 Jan 2007, Charles Duffy wrote:

 Running a backup with a very large number of files (about 150,000) on both
 ends, one of the RPC requests returns a response larger than PY_SSIZE_T_MAX:
 
 Traceback (most recent call last):
  File /usr/bin/rdiff-backup, line 23, in ?
rdiff_backup.Main.Main(sys.argv[1:])
  File /usr/lib64/python2.3/site-packages/rdiff_backup/Main.py, line 286, in
 Main
take_action(rps)
  File /usr/lib64/python2.3/site-packages/rdiff_backup/Main.py, line 256, in
 take_action
elif action == backup: Backup(rps[0], rps[1])
  File /usr/lib64/python2.3/site-packages/rdiff_backup/Main.py, line 306, in
 Backup
backup.Mirror_and_increment(rpin, rpout, incdir)
  File /usr/lib64/python2.3/site-packages/rdiff_backup/backup.py, line 49, in
 Mirror_and_increment
dest_sigiter = DestS.get_sigs(dest_rpath)
  File /usr/lib64/python2.3/site-packages/rdiff_backup/connection.py, line
 445, in __call__
return apply(self.connection.reval, (self.name,) + args)
  File /usr/lib64/python2.3/site-packages/rdiff_backup/connection.py, line
 365, in reval
result = self.get_response(req_num)
  File /usr/lib64/python2.3/site-packages/rdiff_backup/connection.py, line
 314, in get_response
try: req_num, object = self._get()
  File /usr/lib64/python2.3/site-packages/rdiff_backup/connection.py, line
 239, in _get
data = self._read(length)
  File /usr/lib64/python2.3/site-packages/rdiff_backup/connection.py, line
 208, in _read
return self.inpipe.read(length)
 OverflowError: requested number of bytes is more than a Python string can hold
 
 This is, needless to say, a Bad Thing. There are certainly a few fixes
 possible (such as chunking the signatures -- perhaps by directory)... but if
 anyone can think of something better (or is familiar with upstream Python
 changes which might impact this issue, or otherwise has something to
 add/suggest), I'm very much all ears.
 
 
 ___
 rdiff-backup-users mailing list at rdiff-backup-users@nongnu.org
 http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
 Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki
 


___
rdiff-backup-users mailing list at rdiff-backup-users@nongnu.org
http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki

[patch] faster vgetcpu using sidt (take 2)

2007-01-13 Thread dean gaudet

ok here is the latest rev of this patch (against 2.6.20-rc4).

timings in cycles:

baseline   patchedbaseline   patched
no cache   no cachecache  cache
k8 pre-revF2116  1417
k8 revF3117  1417
core2  3816  1214
p4 4941  2424

the degredation in cached timings appears to be due to the 16 byte stack
frame set up for the sidt instruction.  apparently due to -mno-red-zone...
would you accept a patch which re-enables the red-zone for vsyscalls?

here is the slightly updated description:

below is a patch which improves vgetcpu latency on all x86_64 
implementations i've tested.

Nathan Laredo pointed out the sgdt/sidt/sldt instructions are 
userland-accessible and we could use their limit fields to tuck away a few 
bits of per-cpu information.

vgetcpu generally uses lsl at present, but all of sgdt/sidt/sldt are
faster than lsl on all x86_64 processors i've tested.  lsl requires
microcoded permission testing whereas s*dt are free of any such hassle.

sldt is the least expensive of the three instructions however it's a 
hassle to use because processes may want to adjust their ldt.  sidt/sgdt 
have essentially the same performance across all the major architectures 
-- however sidt has the advantage that its limit field is 16-bits, yet any 
value >= 0xfff is essentially "infinite" because there are only 256 (16 
byte) descriptors.  so sidt is probably the best choice of the three.

in benchmarking i've discovered the rdtscp implementation of vgetcpu is 
slower than even the lsl-based implementation on opteron revF.  so i've 
dropped the rdtscp implementation in this patch.  however i've left the 
rdtscp_aux register initialized because i'm sure it's the right choice for 
various proposed vgettimeofday / per-cpu tsc state improvements which need 
the atomic nature of the rdtscp instruction and i hope it'll be used in 
those situations.

at compile time this patch detects if 0x1000 + 
(CONFIG_NR_CPUS<http://arctic.org/~dean/vgetcpu/>

-dean

Signed-off-by: dean gaudet <[EMAIL PROTECTED]>

Index: linux/arch/x86_64/kernel/time.c
===
--- linux.orig/arch/x86_64/kernel/time.c2007-01-13 22:20:46.0 
-0800
+++ linux/arch/x86_64/kernel/time.c 2007-01-13 22:21:01.0 -0800
@@ -957,11 +957,6 @@
if (unsynchronized_tsc())
notsc = 1;
 
-   if (cpu_has(_cpu_data, X86_FEATURE_RDTSCP))
-   vgetcpu_mode = VGETCPU_RDTSCP;
-   else
-   vgetcpu_mode = VGETCPU_LSL;
-
if (vxtime.hpet_address && notsc) {
timetype = hpet_use_timer ? "HPET" : "PIT/HPET";
if (hpet_use_timer)
Index: linux/arch/x86_64/kernel/vsyscall.c
===
--- linux.orig/arch/x86_64/kernel/vsyscall.c2007-01-13 22:20:46.0 
-0800
+++ linux/arch/x86_64/kernel/vsyscall.c 2007-01-13 22:21:01.0 -0800
@@ -46,7 +46,11 @@
 
 int __sysctl_vsyscall __section_sysctl_vsyscall = 1;
 seqlock_t __xtime_lock __section_xtime_lock = SEQLOCK_UNLOCKED;
-int __vgetcpu_mode __section_vgetcpu_mode;
+
+/* is this necessary? */
+#ifndef CONFIG_NODES_SHIFT
+#define CONFIG_NODES_SHIFT 0
+#endif
 
 #include 
 
@@ -147,11 +151,11 @@
 long __vsyscall(2)
 vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
 {
-   unsigned int dummy, p;
+   unsigned int p;
unsigned long j = 0;
 
/* Fast cache - only recompute value once per jiffies and avoid
-  relatively costly rdtscp/cpuid otherwise.
+  relatively costly lsl/sidt otherwise.
   This works because the scheduler usually keeps the process
   on the same CPU and this syscall doesn't guarantee its
   results anyways.
@@ -160,21 +164,30 @@
   If you don't like it pass NULL. */
if (tcache && tcache->blob[0] == (j = __jiffies)) {
p = tcache->blob[1];
-   } else if (__vgetcpu_mode == VGETCPU_RDTSCP) {
-   /* Load per CPU data from RDTSCP */
-   rdtscp(dummy, dummy, p);
-   } else {
+   }
+   else {
+#ifdef VGETCPU_USE_SIDT
+struct {
+char pad[6];   /* avoid unaligned stores */
+u16 size;
+u64 address;
+} idt;
+
+asm("sidt %0" : "=m" (idt.size));
+p = idt.size - 0x1000;
+#else
/* Load per CPU data from GDT */
asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
-   }
-   if (tcache) {
-   tcache->blob[0] = j;
-   tcache->blob[1] = p;
+#e

Re: raid5 software vs hardware: parity calculations?

2007-01-13 Thread dean gaudet

On Sat, 13 Jan 2007, Robin Bowes wrote:

 Bill Davidsen wrote:
 
  There have been several recent threads on the list regarding software
  RAID-5 performance. The reference might be updated to reflect the poor
  write performance of RAID-5 until/unless significant tuning is done.
  Read that as tuning obscure parameters and throwing a lot of memory into
  stripe cache. The reasons for hardware RAID should include performance
  of RAID-5 writes is usually much better than software RAID-5 with
  default tuning.
 
 Could you point me at a source of documentation describing how to
 perform such tuning?
 
 Specifically, I have 8x500GB WD STAT drives on a Supermicro PCI-X 8-port
 SATA card configured as a single RAID6 array (~3TB available space)

linux sw raid6 small write performance is bad because it reads the entire 
stripe, merges the small write, and writes back the changed disks.  
unlike raid5 where a small write can get away with a partial stripe read 
(i.e. the smallest raid5 write will read the target disk, read the parity, 
write the target, and write the updated parity)... afaik this optimization 
hasn't been implemented in raid6 yet.

depending on your use model you might want to go with raid5+spare.  
benchmark if you're not sure.

for raid5/6 i always recommend experimenting with moving your fs journal 
to a raid1 device instead (on separate spindles -- such as your root 
disks).

if this is for a database or fs requiring lots of small writes then 
raid5/6 are generally a mistake... raid10 is the only way to get 
performance.  (hw raid5/6 with nvram support can help a bit in this area, 
but you just can't beat raid10 if you need lots of writes/s.)

beyond those config choices you'll want to become friendly with /sys/block 
and all the myriad of subdirectories and options under there.

in particular:

/sys/block/*/queue/scheduler
/sys/block/*/queue/read_ahead_kb
/sys/block/*/queue/nr_requests
/sys/block/mdX/md/stripe_cache_size

for * = any of the component disks or the mdX itself...

some systems have an /etc/sysfs.conf you can place these settings in to 
have them take effect on reboot.  (sysfsutils package on debuntu)

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch] faster vgetcpu using sidt (take 2)

2007-01-13 Thread dean gaudet

ok here is the latest rev of this patch (against 2.6.20-rc4).

timings in cycles:

baseline   patchedbaseline   patched
no cache   no cachecache  cache
k8 pre-revF2116  1417
k8 revF3117  1417
core2  3816  1214
p4 4941  2424

the degredation in cached timings appears to be due to the 16 byte stack
frame set up for the sidt instruction.  apparently due to -mno-red-zone...
would you accept a patch which re-enables the red-zone for vsyscalls?

here is the slightly updated description:

below is a patch which improves vgetcpu latency on all x86_64 
implementations i've tested.

Nathan Laredo pointed out the sgdt/sidt/sldt instructions are 
userland-accessible and we could use their limit fields to tuck away a few 
bits of per-cpu information.

vgetcpu generally uses lsl at present, but all of sgdt/sidt/sldt are
faster than lsl on all x86_64 processors i've tested.  lsl requires
microcoded permission testing whereas s*dt are free of any such hassle.

sldt is the least expensive of the three instructions however it's a 
hassle to use because processes may want to adjust their ldt.  sidt/sgdt 
have essentially the same performance across all the major architectures 
-- however sidt has the advantage that its limit field is 16-bits, yet any 
value = 0xfff is essentially infinite because there are only 256 (16 
byte) descriptors.  so sidt is probably the best choice of the three.

in benchmarking i've discovered the rdtscp implementation of vgetcpu is 
slower than even the lsl-based implementation on opteron revF.  so i've 
dropped the rdtscp implementation in this patch.  however i've left the 
rdtscp_aux register initialized because i'm sure it's the right choice for 
various proposed vgettimeofday / per-cpu tsc state improvements which need 
the atomic nature of the rdtscp instruction and i hope it'll be used in 
those situations.

at compile time this patch detects if 0x1000 + 
(CONFIG_NR_CPUSCONFIG_NODES_SHIFT) will fit in the idt limit field and 
selects the lsl method otherwise.  i've further added a test for the 20 
bit limit of the lsl method and #error in the event it doesn't fit (we 
could fall all the way back to cpuid method if someone has a box with that 
many cpus*nodes, but i'll let someone else handle that case ;).

given this is a compile-time choice, and rdtscp is always slower than 
sidt, i've dropped the vgetcpu_mode variable.

timing tools and test case can be found at 
http://arctic.org/~dean/vgetcpu/

-dean

Signed-off-by: dean gaudet [EMAIL PROTECTED]

Index: linux/arch/x86_64/kernel/time.c
===
--- linux.orig/arch/x86_64/kernel/time.c2007-01-13 22:20:46.0 
-0800
+++ linux/arch/x86_64/kernel/time.c 2007-01-13 22:21:01.0 -0800
@@ -957,11 +957,6 @@
if (unsynchronized_tsc())
notsc = 1;
 
-   if (cpu_has(boot_cpu_data, X86_FEATURE_RDTSCP))
-   vgetcpu_mode = VGETCPU_RDTSCP;
-   else
-   vgetcpu_mode = VGETCPU_LSL;
-
if (vxtime.hpet_address  notsc) {
timetype = hpet_use_timer ? HPET : PIT/HPET;
if (hpet_use_timer)
Index: linux/arch/x86_64/kernel/vsyscall.c
===
--- linux.orig/arch/x86_64/kernel/vsyscall.c2007-01-13 22:20:46.0 
-0800
+++ linux/arch/x86_64/kernel/vsyscall.c 2007-01-13 22:21:01.0 -0800
@@ -46,7 +46,11 @@
 
 int __sysctl_vsyscall __section_sysctl_vsyscall = 1;
 seqlock_t __xtime_lock __section_xtime_lock = SEQLOCK_UNLOCKED;
-int __vgetcpu_mode __section_vgetcpu_mode;
+
+/* is this necessary? */
+#ifndef CONFIG_NODES_SHIFT
+#define CONFIG_NODES_SHIFT 0
+#endif
 
 #include asm/unistd.h
 
@@ -147,11 +151,11 @@
 long __vsyscall(2)
 vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
 {
-   unsigned int dummy, p;
+   unsigned int p;
unsigned long j = 0;
 
/* Fast cache - only recompute value once per jiffies and avoid
-  relatively costly rdtscp/cpuid otherwise.
+  relatively costly lsl/sidt otherwise.
   This works because the scheduler usually keeps the process
   on the same CPU and this syscall doesn't guarantee its
   results anyways.
@@ -160,21 +164,30 @@
   If you don't like it pass NULL. */
if (tcache  tcache-blob[0] == (j = __jiffies)) {
p = tcache-blob[1];
-   } else if (__vgetcpu_mode == VGETCPU_RDTSCP) {
-   /* Load per CPU data from RDTSCP */
-   rdtscp(dummy, dummy, p);
-   } else {
+   }
+   else {
+#ifdef VGETCPU_USE_SIDT
+struct {
+char pad[6];   /* avoid unaligned stores */
+u16 size

[patch] add IP TOS support

2007-01-13 Thread dean gaudet

i'd like to port mod_iptos to 2.x http://arctic.org/~dean/mod_iptos/... 
and i need apr_socket_opt_set support for IP_TOS.  so here's a patch.

sorry it's against 1.2.7 but that's what debian is still using.  i'm sure 
there'll be feedback and i'll rebase against whatever you want after 
feedback.

i find it odd and confusing that apr_socket_opt_set tries to pretend that 
socket options are all boolean... and for options like SO_RCVBUF there 
isn't even any attempt to get the right answer in apr_socket_opt_get.  so 
following that example i didn't even bother with apr_socket_opt_get 
support for the new APR_SO_IP_TOS.

note that if you go about reading man pages regarding IP_TOS you'll 
probably be mislead into thinking that you can set one and only one bit.  
that's false -- the entire byte is generally available.  the IP TOS field 
is a source of much disagreement and overloaded behaviour on the internet 
at large -- i reference a few of the other attempts to make use of this 
byte on my mod_iptos page above.  there's a quote somewhere about how many 
hours of time have been wasted trying to come up with the one and only 
definition of this field...

however none of this should stop anyone from using it within their own 
adminstrative domain... (you can strip it as it leaves your network if you 
want to do even more crazy things with it than i propose in the mod_iptos 
README).

anyhow, here's my patch... feedback welcome.

thanks
-dean

Index: apr-1.2.7/include/apr_network_io.h
===
--- apr-1.2.7.orig/include/apr_network_io.h 2007-01-13 17:21:25.0 
-0800
+++ apr-1.2.7/include/apr_network_io.h  2007-01-13 17:35:27.0 -0800
@@ -99,6 +99,18 @@
 * until data is available.
 * @see apr_socket_accept_filter
 */
+#define APR_SO_IP_TOS65536 /** IP Type-of-Service field */
+
+/** @} */
+
+/**
+ * @defgroup APR_SO_IP_TOS option convenience definitions
+ * @{
+ */
+#define APR_IPTOS_LOWDELAY  0x10/** Low Delay */
+#define APR_IPTOS_THROUGHPUT0x08/** Throughput */
+#define APR_IPTOS_RELIABILITY   0x04/** Reliability */
+#define APR_IPTOS_LOWCOST   0x02/** Lowest Cost */
 
 /** @} */
 
@@ -589,6 +601,11 @@
  *  of local addresses.
  *APR_SO_SNDBUF --  Set the SendBufferSize
  *APR_SO_RCVBUF --  Set the ReceiveBufferSize
+ *APR_SO_IP_TOS --  Set the IP Type-of-Service byte (any byte
+ *  value can be used subject to many
+ *  conflicting RFCs and proposals too numerous
+ *  to detail here, or use one of the four
+ *  convenience definitions).
  * /PRE
  * @param on Value for the option.
  */
Index: apr-1.2.7/network_io/unix/sockopt.c
===
--- apr-1.2.7.orig/network_io/unix/sockopt.c2007-01-13 17:36:42.0 
-0800
+++ apr-1.2.7/network_io/unix/sockopt.c 2007-01-13 17:47:29.0 -0800
@@ -318,6 +318,17 @@
 return APR_ENOTIMPL;
 #endif
 break;
+case APR_SO_IP_TOS:
+#ifdef IP_TOS
+{
+int value = on;
+if (setsockopt(sock-socketdes, IPPROTO_IP, IP_TOS,
+(void*)value, sizeof(value)) == -1) {
+return errno;
+}
+}
+#endif
+break;
 default:
 return APR_EINVAL;
 }
Index: apr-1.2.7/configure.in
===
--- apr-1.2.7.orig/configure.in 2007-01-13 17:44:31.0 -0800
+++ apr-1.2.7/configure.in  2007-01-13 17:44:58.0 -0800
@@ -1015,6 +1015,7 @@
 kernel/OS.h\
 net/errno.h\
 netinet/in.h   \
+netinet/ip.h   \
 netinet/sctp.h  \
 netinet/sctp_uio.h  \
 sys/file.h \
Index: apr-1.2.7/include/arch/unix/apr_arch_networkio.h
===
--- apr-1.2.7.orig/include/arch/unix/apr_arch_networkio.h   2007-01-13 
17:45:05.0 -0800
+++ apr-1.2.7/include/arch/unix/apr_arch_networkio.h2007-01-13 
17:45:16.0 -0800
@@ -61,6 +61,9 @@
 #if APR_HAVE_NETINET_IN_H
 #include netinet/in.h
 #endif
+#if APR_HAVE_NETINET_IP_H
+#include netinet/ip.h
+#endif
 #if APR_HAVE_ARPA_INET_H
 #include arpa/inet.h
 #endif
Index: apr-1.2.7/test/testsockopt.c
===
--- apr-1.2.7.orig/test/testsockopt.c   2007-01-13 17:36:08.0 -0800
+++ apr-1.2.7/test/testsockopt.c2007-01-13 17:56:39.0 -0800
@@ -115,6 +115,19 @@
 #endif
 }
 
+static void set_ip_tos(abts_case *tc, void *data)
+{
+apr_status_t rv;
+
+/* can't

porting mod_iptos to 2.x

2007-01-13 Thread dean gaudet

i'd like to port mod_iptos to 2.x http://arctic.org/~dean/mod_iptos/ ... 
and preferably have it accepted into the main distribution.

in 1.3.x it was trivial to do mod_iptos as a module because all i had to 
do was setsockopt(r-connection-client-fd, IPPROTO_IP, IP_TOS, ...).  
i've already sent a patch to [EMAIL PROTECTED] to get the apr support for that 
under 
way... but for 2.x there's a bit more complexity getting the socket.

there are many extra possibilities in a 2.x port, and i'm not sure which 
way to go, so i'm looking for input first.

in the 1.3.x module i implemented two options:

IPTOS tos_spec
IPTOSthreshold num_bytes tos_spec

the first sets the default IPTOS to the specified value (and is 
configurable per directory/etc).

the second works only for static responses and would change the tos_spec 
if the response was greater than a certain size.

for example:

IPTOS none
IPTOSthreshold 200 throughput

this will segregate small and large responses and with many network 
traffic shapers will automagically give your website better response times 
when your uplink is choked.

for 2.x i suppose the right approach for implementing the threshold is via 
filter.  i assume there's some way to get the filesize when it's known -- 
and in cases where the filesize is unknown the filter could just change 
the threshold after the threshold is passed (not ideal, but doesn't matter 
for my needs).

i could see implementing this either as part of the core or as a module... 
whatever seems to fit.  (assuming a module can actually get a hold of the 
client socket somehow.)

for old timers: it's been 5 or 6 years since i've looked at apache 2.x 
source and things were in a heavy state of flux back then, so don't assume 
i know the best way to implement this :)

suggestions welcome.

-dean

p.s. i have no doubt there are traffic shaping / bandwidth limiting 
modules for apache 2.x, but those are only appropriate for a single server 
and with no other uplink contention.  the network is the right place to do 
traffic shaping.

Re: raid5 software vs hardware: parity calculations?

2007-01-12 Thread dean gaudet

On Thu, 11 Jan 2007, James Ralston wrote:

 I'm having a discussion with a coworker concerning the cost of md's
 raid5 implementation versus hardware raid5 implementations.
 
 Specifically, he states:
 
  The performance [of raid5 in hardware] is so much better with the
  write-back caching on the card and the offload of the parity, it
  seems to me that the minor increase in work of having to upgrade the
  firmware if there's a buggy one is a highly acceptable trade-off to
  the increased performance.  The md driver still commits you to
  longer run queues since IO calls to disk, parity calculator and the
  subsequent kflushd operations are non-interruptible in the CPU.  A
  RAID card with write-back cache releases the IO operation virtually
  instantaneously.
 
 It would seem that his comments have merit, as there appears to be
 work underway to move stripe operations outside of the spinlock:
 
 http://lwn.net/Articles/184102/
 
 What I'm curious about is this: for real-world situations, how much
 does this matter?  In other words, how hard do you have to push md
 raid5 before doing dedicated hardware raid5 becomes a real win?

hardware with battery backed write cache is going to beat the software at 
small write traffic latency essentially all the time but it's got nothing 
to do with the parity computation.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: O_DIRECT question

2007-01-11 Thread dean gaudet

On Thu, 11 Jan 2007, Linus Torvalds wrote:

> On Thu, 11 Jan 2007, Viktor wrote:
> > 
> > OK, madvise() used with mmap'ed file allows to have reads from a file
> > with zero-copy between kernel/user buffers and don't pollute cache
> > memory unnecessarily. But how about writes? How is to do zero-copy
> > writes to a file and don't pollute cache memory without using O_DIRECT?
> > Do I miss the appropriate interface?
> 
> mmap()+msync() can do that too.
> 
> Also, regular user-space page-aligned data could easily just be moved into 
> the page cache. We actually have a lot of the infrastructure for it. See 
> the "splice()" system call.

it seems to me that if splice and fadvise and related things are 
sufficient for userland to take care of things "properly" then O_DIRECT 
could be changed into splice/fadvise calls either by a library or in the 
kernel directly...

looking at the splice(2) api it seems like it'll be difficult to implement 
O_DIRECT pread/pwrite from userland using splice... so there'd need to be 
some help there.

i'm probably missing something.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH - RFC] allow setting vm_dirty below 1% for large memory machines

2007-01-11 Thread dean gaudet

On Thu, 11 Jan 2007, Andrew Morton wrote:

> On Thu, 11 Jan 2007 03:04:00 -0800 (PST)
> dean gaudet <[EMAIL PROTECTED]> wrote:
> 
> > On Tue, 9 Jan 2007, Neil Brown wrote:
> > 
> > > Imagine a machine with lots of memory - say 100Gig.
> > 
> > i've had these problems on machines as "small" as 8GiB.  the real problem 
> > is that the kernel will let millions of potential (write) IO ops stack up 
> > for a device which can handle only mere 100s of IOs per second.  (and i'm 
> > not convinced it does the IOs in a sane order when it has millions to 
> > choose from)
> > 
> > replacing the percentage based dirty_ratio / dirty_background_ratio with 
> > sane kibibyte units is a good fix... but i'm not sure it's sufficient.
> > 
> > it seems like the "flow control" mechanism (i.e. dirty_ratio) should be on 
> > a device basis...
> > 
> > try running doug ledford'd memtest.sh on an 8GiB box with a single disk, 
> > let it go a few minutes then ^C and type "sync".  i've had to wait 10 
> > minutes (2.6.18 with default vm settings).
> > 
> > it makes it hard to guarantee a box can shutdown quickly -- nasty for 
> > setting up UPS on-battery timeouts for example.
> > 
> 
> Increasing the request queue size should help there
> (/sys/block/sda/queue/nr_requests).  Maybe 25% or more benefit with that
> test, at a guess.

hmm i've never had much luck with increasing nr_requests... if i get a 
chance i'll reproduce the problem and try that.


> Probably initscripts should do that rather than leaving the kernel defaults
> in place.  It's a bit tricky for the kernel to do because the decision
> depends upon the number of disks in the system, as well as the amount of
> memory.
> 
> Or perhaps the kernel should implement a system-wide limit on the number of
> requests in flight.  While avoiding per-device starvation.  Tricky.

actually a global dirty_ratio causes interference between devices which 
should otherwise not block each other...

if you set up a "dd if=/dev/zero of=/dev/sdb bs=1M" it shouldn't affect 
write performance on sda -- but it does... because the dd basically 
dirties all of the "dirty_background_ratio" pages and then any task 
writing to sda has to block in the foreground...  (i've had this happen in 
practice -- my hack fix is oflag=direct on the dd... but the problem still 
exists.)

i'm not saying fixing any of this is easy, i'm just being a user griping 
about it :)

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH - RFC] allow setting vm_dirty below 1% for large memory machines

2007-01-11 Thread dean gaudet

On Tue, 9 Jan 2007, Neil Brown wrote:

> Imagine a machine with lots of memory - say 100Gig.

i've had these problems on machines as "small" as 8GiB.  the real problem 
is that the kernel will let millions of potential (write) IO ops stack up 
for a device which can handle only mere 100s of IOs per second.  (and i'm 
not convinced it does the IOs in a sane order when it has millions to 
choose from)

replacing the percentage based dirty_ratio / dirty_background_ratio with 
sane kibibyte units is a good fix... but i'm not sure it's sufficient.

it seems like the "flow control" mechanism (i.e. dirty_ratio) should be on 
a device basis...

try running doug ledford'd memtest.sh on an 8GiB box with a single disk, 
let it go a few minutes then ^C and type "sync".  i've had to wait 10 
minutes (2.6.18 with default vm settings).

it makes it hard to guarantee a box can shutdown quickly -- nasty for 
setting up UPS on-battery timeouts for example.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH - RFC] allow setting vm_dirty below 1% for large memory machines

2007-01-11 Thread dean gaudet

On Tue, 9 Jan 2007, Neil Brown wrote:

 Imagine a machine with lots of memory - say 100Gig.

i've had these problems on machines as small as 8GiB.  the real problem 
is that the kernel will let millions of potential (write) IO ops stack up 
for a device which can handle only mere 100s of IOs per second.  (and i'm 
not convinced it does the IOs in a sane order when it has millions to 
choose from)

replacing the percentage based dirty_ratio / dirty_background_ratio with 
sane kibibyte units is a good fix... but i'm not sure it's sufficient.

it seems like the flow control mechanism (i.e. dirty_ratio) should be on 
a device basis...

try running doug ledford'd memtest.sh on an 8GiB box with a single disk, 
let it go a few minutes then ^C and type sync.  i've had to wait 10 
minutes (2.6.18 with default vm settings).

it makes it hard to guarantee a box can shutdown quickly -- nasty for 
setting up UPS on-battery timeouts for example.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH - RFC] allow setting vm_dirty below 1% for large memory machines

2007-01-11 Thread dean gaudet

On Thu, 11 Jan 2007, Andrew Morton wrote:

 On Thu, 11 Jan 2007 03:04:00 -0800 (PST)
 dean gaudet [EMAIL PROTECTED] wrote:
 
  On Tue, 9 Jan 2007, Neil Brown wrote:
  
   Imagine a machine with lots of memory - say 100Gig.
  
  i've had these problems on machines as small as 8GiB.  the real problem 
  is that the kernel will let millions of potential (write) IO ops stack up 
  for a device which can handle only mere 100s of IOs per second.  (and i'm 
  not convinced it does the IOs in a sane order when it has millions to 
  choose from)
  
  replacing the percentage based dirty_ratio / dirty_background_ratio with 
  sane kibibyte units is a good fix... but i'm not sure it's sufficient.
  
  it seems like the flow control mechanism (i.e. dirty_ratio) should be on 
  a device basis...
  
  try running doug ledford'd memtest.sh on an 8GiB box with a single disk, 
  let it go a few minutes then ^C and type sync.  i've had to wait 10 
  minutes (2.6.18 with default vm settings).
  
  it makes it hard to guarantee a box can shutdown quickly -- nasty for 
  setting up UPS on-battery timeouts for example.
  
 
 Increasing the request queue size should help there
 (/sys/block/sda/queue/nr_requests).  Maybe 25% or more benefit with that
 test, at a guess.

hmm i've never had much luck with increasing nr_requests... if i get a 
chance i'll reproduce the problem and try that.


 Probably initscripts should do that rather than leaving the kernel defaults
 in place.  It's a bit tricky for the kernel to do because the decision
 depends upon the number of disks in the system, as well as the amount of
 memory.
 
 Or perhaps the kernel should implement a system-wide limit on the number of
 requests in flight.  While avoiding per-device starvation.  Tricky.

actually a global dirty_ratio causes interference between devices which 
should otherwise not block each other...

if you set up a dd if=/dev/zero of=/dev/sdb bs=1M it shouldn't affect 
write performance on sda -- but it does... because the dd basically 
dirties all of the dirty_background_ratio pages and then any task 
writing to sda has to block in the foreground...  (i've had this happen in 
practice -- my hack fix is oflag=direct on the dd... but the problem still 
exists.)

i'm not saying fixing any of this is easy, i'm just being a user griping 
about it :)

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-11 Thread dean gaudet

On Thu, 11 Jan 2007, Linus Torvalds wrote:

 On Thu, 11 Jan 2007, Viktor wrote:
  
  OK, madvise() used with mmap'ed file allows to have reads from a file
  with zero-copy between kernel/user buffers and don't pollute cache
  memory unnecessarily. But how about writes? How is to do zero-copy
  writes to a file and don't pollute cache memory without using O_DIRECT?
  Do I miss the appropriate interface?
 
 mmap()+msync() can do that too.
 
 Also, regular user-space page-aligned data could easily just be moved into 
 the page cache. We actually have a lot of the infrastructure for it. See 
 the splice() system call.

it seems to me that if splice and fadvise and related things are 
sufficient for userland to take care of things properly then O_DIRECT 
could be changed into splice/fadvise calls either by a library or in the 
kernel directly...

looking at the splice(2) api it seems like it'll be difficult to implement 
O_DIRECT pread/pwrite from userland using splice... so there'd need to be 
some help there.

i'm probably missing something.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] faster vgetcpu using sidt

2007-01-08 Thread dean gaudet

On Sat, 6 Jan 2007, dean gaudet wrote:

> below is a patch which improves vgetcpu latency on all x86_64 
> implementations i've tested.
> 
> Nathan Laredo pointed out the sgdt/sidt/sldt instructions are 
> userland-accessible and we could use their limit fields to tuck away a few 
> bits of per-cpu information.
...

i got a hold of a p4 (model 4) and ran the timings there:

baselinepatched
no cachecache
k8 pre-revF21 14  16
k8 revF31 14  17
core2  38 12  17
p4 49 24  37

not as good as i hoped... i'll have to put the cache back in just for the 
p4... so i'll respin my patch with the cache back in place.

another thought occured to me -- 64-bit processes can't actually use their 
LDT can they?  in that case i could probably use sldt (faster than sidt) 
for 64-bit procs and fallback to sidt for 32-bit emulation (which doesn't 
exist for this vsyscall yet anyhow).

let me know if you have any other feedback.

thanks
-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] All Transmeta CPUs have constant TSCs

2007-01-08 Thread dean gaudet

On Mon, 8 Jan 2007, H. Peter Anvin wrote:

> I *definitely* support the concept that RDPMC 0 should could CPU cycles by
> convention in Linux.

unfortunately that'd be very limiting and annoying on core2 processors 
which have dedicated perf counters for clocks unhalted (actual vs. 
nominal), but only 2 configurable perf counters.  i forget what ecx value 
gets you the dedicated counters... but a solution which might work would 
be a syscall to return the perf counter number...

or we could just merge perfmon ;)

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] All Transmeta CPUs have constant TSCs

2007-01-08 Thread dean gaudet

On Fri, 5 Jan 2007, Jan Engelhardt wrote:

> 
> On Jan 4 2007 17:48, H. Peter Anvin wrote:
> >
> >[i386] All Transmeta CPUs have constant TSCs
> >All Transmeta CPUs ever produced have constant-rate TSCs.
> 
> A TSC is ticking according to the CPU frequency, is not it?

transmeta decided years before intel and amd that a constant rate tsc 
(unaffected by P-state) was the only sane choice.  on transmeta cpus the 
tsc increments at the maximum cpu frequency no matter what the P-state 
(and no matter what longrun is doing behind the kernel's back).

mind you, many people thought this was a crazy choice at the time...

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] All Transmeta CPUs have constant TSCs

2007-01-08 Thread dean gaudet

On Fri, 5 Jan 2007, Jan Engelhardt wrote:

 
 On Jan 4 2007 17:48, H. Peter Anvin wrote:
 
 [i386] All Transmeta CPUs have constant TSCs
 All Transmeta CPUs ever produced have constant-rate TSCs.
 
 A TSC is ticking according to the CPU frequency, is not it?

transmeta decided years before intel and amd that a constant rate tsc 
(unaffected by P-state) was the only sane choice.  on transmeta cpus the 
tsc increments at the maximum cpu frequency no matter what the P-state 
(and no matter what longrun is doing behind the kernel's back).

mind you, many people thought this was a crazy choice at the time...

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] All Transmeta CPUs have constant TSCs

2007-01-08 Thread dean gaudet

On Mon, 8 Jan 2007, H. Peter Anvin wrote:

 I *definitely* support the concept that RDPMC 0 should could CPU cycles by
 convention in Linux.

unfortunately that'd be very limiting and annoying on core2 processors 
which have dedicated perf counters for clocks unhalted (actual vs. 
nominal), but only 2 configurable perf counters.  i forget what ecx value 
gets you the dedicated counters... but a solution which might work would 
be a syscall to return the perf counter number...

or we could just merge perfmon ;)

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] faster vgetcpu using sidt

2007-01-08 Thread dean gaudet

On Sat, 6 Jan 2007, dean gaudet wrote:

 below is a patch which improves vgetcpu latency on all x86_64 
 implementations i've tested.
 
 Nathan Laredo pointed out the sgdt/sidt/sldt instructions are 
 userland-accessible and we could use their limit fields to tuck away a few 
 bits of per-cpu information.
...

i got a hold of a p4 (model 4) and ran the timings there:

baselinepatched
no cachecache
k8 pre-revF21 14  16
k8 revF31 14  17
core2  38 12  17
p4 49 24  37

not as good as i hoped... i'll have to put the cache back in just for the 
p4... so i'll respin my patch with the cache back in place.

another thought occured to me -- 64-bit processes can't actually use their 
LDT can they?  in that case i could probably use sldt (faster than sidt) 
for 64-bit procs and fallback to sidt for 32-bit emulation (which doesn't 
exist for this vsyscall yet anyhow).

let me know if you have any other feedback.

thanks
-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Bug#315547: [patch] stop fd leak in libnss-ldap

2007-01-08 Thread dean gaudet

i believe these 3 bugs are the same problem.

when the ldap server closes the connection during a response, libnss-ldap 
doesn't really notice at all... it returns an error code to the caller but 
doesn't pay attention to the error code itself.  then nscd (or whatever 
other caller is involved) tries to re-open a new connection and 
libnss-ldap decides that someone has stolen its old socket and proceeds to 
leak the socket.

the do_get_our_socket portion of the patch below fixes the leak.

the do_drop_connection portion of this patch which is not technically 
required to fix the leak -- it fixes another bug:  libnss-ldap is totally 
broken in multithreaded programs (such as nscd) because you can't do 
close(10); dup2(14,10); and guarantee another thread didn't re-open fd 
10 in the meanwhile.  the patch as included fixes this problem but only 
when non-ssl connections are in use... in the case ssl connections are in 
use it's just totally broken and can't be fixed.  yay.  (however thanks to 
fixing the do_get_our_socket code the drop code is rarely called in the 
dangerous manner.)

-dean

Index: libnss-ldap-250/ldap-nss.c
===
--- libnss-ldap-250.orig/ldap-nss.c 2006-04-26 18:19:00.0 -0700
+++ libnss-ldap-250/ldap-nss.c  2007-01-08 21:40:41.0 -0800
@@ -793,23 +793,31 @@ do_get_our_socket(int *sd)
   NSS_LDAP_SOCKLEN_T peernamelen = sizeof (peername);
 
   if (getsockname (*sd, (struct sockaddr *) sockname, socknamelen) != 0 
||
-  getpeername (*sd, (struct sockaddr *) peername, peernamelen) != 0)
+ !do_sockaddr_isequal (__session.ls_sockname,
+   socknamelen,
+   sockname,
+   socknamelen))
+{
+  isOurSocket = 0;
+}
+  /*
+   * XXX: We don't pay any attention to return codes in places such as
+   * do_search_s so we never observe when the other end has disconnected
+   * our socket.  In that case we'll get an ENOTCONN error here... and
+   * it's best we ignore the error -- otherwise we'll leak a 
filedescriptor.
+   * The correct fix would be to test error codes in many places.
+   */
+  else if (getpeername (*sd, (struct sockaddr *) peername, peernamelen) 
!= 0)
{
- isOurSocket = 0;
+  if (errno != ENOTCONN)
+isOurSocket = 0;
}
   else
{
- isOurSocket = do_sockaddr_isequal (__session.ls_sockname,
-socknamelen,
-sockname,
-socknamelen);
- if (isOurSocket)
-   {
- isOurSocket = do_sockaddr_isequal (__session.ls_peername,
-peernamelen,
-peername,
-peernamelen);
-   }
+  isOurSocket = do_sockaddr_isequal (__session.ls_peername,
+  peernamelen,
+  peername,
+  peernamelen);
}
 }
 #endif /* HAVE_LDAPSSL_CLIENT_INIT */
@@ -876,13 +884,16 @@ do_drop_connection(int sd, int closeSd)
 dummyfd = socket (AF_INET, SOCK_STREAM, 0);
 if (dummyfd  -1  dummyfd != sd)
   {
-   do_closefd (sd);
+/* we must let dup2 close sd for us to avoid race conditions
+ * in multithreaded code.
+ */
do_dupfd (dummyfd, sd);
do_closefd (dummyfd);
   }
 
 #ifdef HAVE_LDAP_LD_FREE
 #if defined(LDAP_API_FEATURE_X_OPENLDAP)  (LDAP_API_VERSION  2000)
+/* XXX: when using openssl this will *ALWAYS* close the fd */
 (void) ldap_ld_free (__session.ls_conn, 0, NULL, NULL);
 #else
 (void) ldap_ld_free (__session.ls_conn, 0);
@@ -892,13 +903,18 @@ do_drop_connection(int sd, int closeSd)
 #endif /* HAVE_LDAP_LD_FREE */
 
 /* Do we want our original sd back? */
-do_closefd (sd);
 if (savedfd  -1)
   {
if (closeSd == 0)
  do_dupfd (savedfd, sd);
+else
+  do_closefd (sd);
do_closefd (savedfd);
-}
+  }
+else
+  {
+do_closefd (sd);
+  }
   }
 #else /* No sd available */
   {


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Bug#315547: [patch] stop fd leak in libnss-ldap

2007-01-08 Thread dean gaudet

i believe these 3 bugs are the same problem.

when the ldap server closes the connection during a response, libnss-ldap 
doesn't really notice at all... it returns an error code to the caller but 
doesn't pay attention to the error code itself.  then nscd (or whatever 
other caller is involved) tries to re-open a new connection and 
libnss-ldap decides that someone has stolen its old socket and proceeds to 
leak the socket.

the do_get_our_socket portion of the patch below fixes the leak.

the do_drop_connection portion of this patch which is not technically 
required to fix the leak -- it fixes another bug:  libnss-ldap is totally 
broken in multithreaded programs (such as nscd) because you can't do 
close(10); dup2(14,10); and guarantee another thread didn't re-open fd 
10 in the meanwhile.  the patch as included fixes this problem but only 
when non-ssl connections are in use... in the case ssl connections are in 
use it's just totally broken and can't be fixed.  yay.  (however thanks to 
fixing the do_get_our_socket code the drop code is rarely called in the 
dangerous manner.)

-dean

Index: libnss-ldap-250/ldap-nss.c
===
--- libnss-ldap-250.orig/ldap-nss.c 2006-04-26 18:19:00.0 -0700
+++ libnss-ldap-250/ldap-nss.c  2007-01-08 21:40:41.0 -0800
@@ -793,23 +793,31 @@ do_get_our_socket(int *sd)
   NSS_LDAP_SOCKLEN_T peernamelen = sizeof (peername);
 
   if (getsockname (*sd, (struct sockaddr *) sockname, socknamelen) != 0 
||
-  getpeername (*sd, (struct sockaddr *) peername, peernamelen) != 0)
+ !do_sockaddr_isequal (__session.ls_sockname,
+   socknamelen,
+   sockname,
+   socknamelen))
+{
+  isOurSocket = 0;
+}
+  /*
+   * XXX: We don't pay any attention to return codes in places such as
+   * do_search_s so we never observe when the other end has disconnected
+   * our socket.  In that case we'll get an ENOTCONN error here... and
+   * it's best we ignore the error -- otherwise we'll leak a 
filedescriptor.
+   * The correct fix would be to test error codes in many places.
+   */
+  else if (getpeername (*sd, (struct sockaddr *) peername, peernamelen) 
!= 0)
{
- isOurSocket = 0;
+  if (errno != ENOTCONN)
+isOurSocket = 0;
}
   else
{
- isOurSocket = do_sockaddr_isequal (__session.ls_sockname,
-socknamelen,
-sockname,
-socknamelen);
- if (isOurSocket)
-   {
- isOurSocket = do_sockaddr_isequal (__session.ls_peername,
-peernamelen,
-peername,
-peernamelen);
-   }
+  isOurSocket = do_sockaddr_isequal (__session.ls_peername,
+  peernamelen,
+  peername,
+  peernamelen);
}
 }
 #endif /* HAVE_LDAPSSL_CLIENT_INIT */
@@ -876,13 +884,16 @@ do_drop_connection(int sd, int closeSd)
 dummyfd = socket (AF_INET, SOCK_STREAM, 0);
 if (dummyfd  -1  dummyfd != sd)
   {
-   do_closefd (sd);
+/* we must let dup2 close sd for us to avoid race conditions
+ * in multithreaded code.
+ */
do_dupfd (dummyfd, sd);
do_closefd (dummyfd);
   }
 
 #ifdef HAVE_LDAP_LD_FREE
 #if defined(LDAP_API_FEATURE_X_OPENLDAP)  (LDAP_API_VERSION  2000)
+/* XXX: when using openssl this will *ALWAYS* close the fd */
 (void) ldap_ld_free (__session.ls_conn, 0, NULL, NULL);
 #else
 (void) ldap_ld_free (__session.ls_conn, 0);
@@ -892,13 +903,18 @@ do_drop_connection(int sd, int closeSd)
 #endif /* HAVE_LDAP_LD_FREE */
 
 /* Do we want our original sd back? */
-do_closefd (sd);
 if (savedfd  -1)
   {
if (closeSd == 0)
  do_dupfd (savedfd, sd);
+else
+  do_closefd (sd);
do_closefd (savedfd);
-}
+  }
+else
+  {
+do_closefd (sd);
+  }
   }
 #else /* No sd available */
   {


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

2007-01-07 Thread dean gaudet

On Wed, 3 Jan 2007, Andrew Morton wrote:

> On Wed, 03 Jan 2007 22:56:07 -0800 (PST)
> David Miller <[EMAIL PROTECTED]> wrote:
> 
> > Note that the original rtorrent debian bug report was against 2.6.18
> 
> I think that was 2.6.18+debian-added-dirty-page-tracking-patches.

i've seen it on a 2.6.18.4 box (no other patches) which was running 
rtorrent at 50mbps rates for a few days... i was only using the box to 
stress a new circuit before putting live traffic on it, so i didn't spend 
any time debugging the failures.

at least i'm assuming this is the same problem you've all just solved: 
rtorrent reports a sha1 miscompare after the download is finished when it 
rechecks the entire file before it enters seeding mode?

in this setup the rtorrent was uploading to ~3k clients simultaneously at 
50mbps... and downloading new torrents at a rate of ~2mbps (very gross 
average -- the download is a lot more bursty than that in practice).

the box has 2GiB of ram, and there were 5.3GB of torrents "live" at any 
time.  rtorrent is pretty bad at dealing with high bitrates for working 
sets larger than ram -- but the 10krpm scsi disk was only ~25% busy 
(unlike other setups i've tried at this bitrates where the disks become 
the bottleneck unless i back off on the torrent working set size).

btw if anyone has a fast pipe and wants to retest the conditions it should 
be easy to reproduce... just join the electricsheep bt swarm and you'll 
almost instantly fill your uplink.  the clients are very hungry ;)  none 
of this is pirated material -- it's basically a distributed render farm 
sharing animations via bt.  let me know if you want help setting such a 
test up.

i attached the .config in case there's anything of interest in it.

-dean

config-2.6.18.4.bz2
Description: Binary data

Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

2007-01-07 Thread dean gaudet

On Wed, 3 Jan 2007, Andrew Morton wrote:

 On Wed, 03 Jan 2007 22:56:07 -0800 (PST)
 David Miller [EMAIL PROTECTED] wrote:
 
  Note that the original rtorrent debian bug report was against 2.6.18
 
 I think that was 2.6.18+debian-added-dirty-page-tracking-patches.

i've seen it on a 2.6.18.4 box (no other patches) which was running 
rtorrent at 50mbps rates for a few days... i was only using the box to 
stress a new circuit before putting live traffic on it, so i didn't spend 
any time debugging the failures.

at least i'm assuming this is the same problem you've all just solved: 
rtorrent reports a sha1 miscompare after the download is finished when it 
rechecks the entire file before it enters seeding mode?

in this setup the rtorrent was uploading to ~3k clients simultaneously at 
50mbps... and downloading new torrents at a rate of ~2mbps (very gross 
average -- the download is a lot more bursty than that in practice).

the box has 2GiB of ram, and there were 5.3GB of torrents live at any 
time.  rtorrent is pretty bad at dealing with high bitrates for working 
sets larger than ram -- but the 10krpm scsi disk was only ~25% busy 
(unlike other setups i've tried at this bitrates where the disks become 
the bottleneck unless i back off on the torrent working set size).

btw if anyone has a fast pipe and wants to retest the conditions it should 
be easy to reproduce... just join the electricsheep bt swarm and you'll 
almost instantly fill your uplink.  the clients are very hungry ;)  none 
of this is pirated material -- it's basically a distributed render farm 
sharing animations via bt.  let me know if you want help setting such a 
test up.

i attached the .config in case there's anything of interest in it.

-dean

config-2.6.18.4.bz2
Description: Binary data

[patch] faster vgetcpu using sidt

2007-01-06 Thread dean gaudet

below is a patch which improves vgetcpu latency on all x86_64 
implementations i've tested.

Nathan Laredo pointed out the sgdt/sidt/sldt instructions are 
userland-accessible and we could use their limit fields to tuck away a few 
bits of per-cpu information.

vgetcpu generally uses lsl at present, but all of sgdt/sidt/sldt are 
faster than lsl on all x86_64 processors i've tested.  on p4 processers 
lsl tends to be 150 cycles whereas the s*dt instructions are 15 cycles or 
less.  lsl requires microcoded permission testing whereas s*dt are free 
of any such hassle.

sldt is the least expensive of the three instructions however it's a 
hassle to use because processes may want to adjust their ldt.  sidt/sgdt 
have essentially the same performance across all the major architectures 
-- however sidt has the advantage that its limit field is 16-bits, yet any 
value >= 0xfff is essentially "infinite" because there are only 256 (16 
byte) descriptors.  so sidt is probably the best choice of the three.

in benchmarking i've discovered the rdtscp implementation of vgetcpu is 
slower than even the lsl-based implementation on opteron revF.  so i've 
dropped the rdtscp implementation in this patch.  however i've left the 
rdtscp_aux register initialized because i'm sure it's the right choice for 
various proposed vgettimeofday / per-cpu tsc state improvements which need 
the atomic nature of the rdtscp instruction and i hope it'll be used in 
those situations.

at compile time this patch detects if 0x1000 + 
(CONFIG_NR_CPUS<http://arctic.org/~dean/vgetcpu/>

-dean

Signed-off-by: dean gaudet <[EMAIL PROTECTED]>

Index: linux/arch/x86_64/kernel/time.c
===
--- linux.orig/arch/x86_64/kernel/time.c2007-01-06 13:31:10.0 
-0800
+++ linux/arch/x86_64/kernel/time.c 2007-01-06 16:04:01.0 -0800
@@ -957,11 +957,6 @@
if (unsynchronized_tsc())
notsc = 1;
 
-   if (cpu_has(_cpu_data, X86_FEATURE_RDTSCP))
-   vgetcpu_mode = VGETCPU_RDTSCP;
-   else
-   vgetcpu_mode = VGETCPU_LSL;
-
if (vxtime.hpet_address && notsc) {
timetype = hpet_use_timer ? "HPET" : "PIT/HPET";
if (hpet_use_timer)
Index: linux/arch/x86_64/kernel/vsyscall.c
===
--- linux.orig/arch/x86_64/kernel/vsyscall.c2007-01-06 13:31:10.0 
-0800
+++ linux/arch/x86_64/kernel/vsyscall.c 2007-01-06 17:29:36.0 -0800
@@ -40,13 +40,18 @@
 #include 
 #include 
 #include 
+#include 
 
 #define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr)))
 #define __syscall_clobber "r11","rcx","memory"
 
 int __sysctl_vsyscall __section_sysctl_vsyscall = 1;
 seqlock_t __xtime_lock __section_xtime_lock = SEQLOCK_UNLOCKED;
-int __vgetcpu_mode __section_vgetcpu_mode;
+
+/* is this necessary? */
+#ifndef CONFIG_NODES_SHIFT
+#define CONFIG_NODES_SHIFT 0
+#endif
 
 #include 
 
@@ -147,11 +152,21 @@
 long __vsyscall(2)
 vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
 {
-   unsigned int dummy, p;
+   unsigned int p;
+#ifdef VGETCPU_USE_SIDT
+   struct {
+   char pad[6];/* avoid unaligned stores */
+   u16 size;
+   u64 address;
+   } idt;
+
+   asm("sidt %0" : "=m" (idt.size));
+   p = idt.size - 0x1000;
+#else
unsigned long j = 0;
 
/* Fast cache - only recompute value once per jiffies and avoid
-  relatively costly rdtscp/cpuid otherwise.
+  relatively costly lsl otherwise.
   This works because the scheduler usually keeps the process
   on the same CPU and this syscall doesn't guarantee its
   results anyways.
@@ -160,21 +175,20 @@
   If you don't like it pass NULL. */
if (tcache && tcache->blob[0] == (j = __jiffies)) {
p = tcache->blob[1];
-   } else if (__vgetcpu_mode == VGETCPU_RDTSCP) {
-   /* Load per CPU data from RDTSCP */
-   rdtscp(dummy, dummy, p);
-   } else {
+   }
+   else {
/* Load per CPU data from GDT */
asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
+   if (tcache) {
+   tcache->blob[0] = j;
+   tcache->blob[1] = p;
+   }
}
-   if (tcache) {
-   tcache->blob[0] = j;
-   tcache->blob[1] = p;
-   }
+#endif
if (cpu)
-   *cpu = p & 0xfff;
+   *cpu = p >> CONFIG_NODES_SHIFT;
if (node)
-   *node = p >> 12;
+   *node = p & ((1<> 4) << 48;
+  in user space in vgetcpu. */
+

[patch] faster vgetcpu using sidt

2007-01-06 Thread dean gaudet

below is a patch which improves vgetcpu latency on all x86_64 
implementations i've tested.

Nathan Laredo pointed out the sgdt/sidt/sldt instructions are 
userland-accessible and we could use their limit fields to tuck away a few 
bits of per-cpu information.

vgetcpu generally uses lsl at present, but all of sgdt/sidt/sldt are 
faster than lsl on all x86_64 processors i've tested.  on p4 processers 
lsl tends to be 150 cycles whereas the s*dt instructions are 15 cycles or 
less.  lsl requires microcoded permission testing whereas s*dt are free 
of any such hassle.

sldt is the least expensive of the three instructions however it's a 
hassle to use because processes may want to adjust their ldt.  sidt/sgdt 
have essentially the same performance across all the major architectures 
-- however sidt has the advantage that its limit field is 16-bits, yet any 
value = 0xfff is essentially infinite because there are only 256 (16 
byte) descriptors.  so sidt is probably the best choice of the three.

in benchmarking i've discovered the rdtscp implementation of vgetcpu is 
slower than even the lsl-based implementation on opteron revF.  so i've 
dropped the rdtscp implementation in this patch.  however i've left the 
rdtscp_aux register initialized because i'm sure it's the right choice for 
various proposed vgettimeofday / per-cpu tsc state improvements which need 
the atomic nature of the rdtscp instruction and i hope it'll be used in 
those situations.

at compile time this patch detects if 0x1000 + 
(CONFIG_NR_CPUSCONFIG_NODES_SHIFT) will fit in the idt limit field and 
selects the lsl method otherwise.  i've further added a test for the 20 
bit limit of the lsl method and #error in the event it doesn't fit (we 
could fall all the way back to cpuid method if someone has a box with that 
many cpus*nodes, but i'll let someone else handle that case ;).

given this is a compile-time choice, and rdtscp is always slower than 
sidt, i've dropped the vgetcpu_mode variable.

i've also dropped the cache support in the sidt case -- depending on the 
compiler and cpu i found it to be 1 cycle slower than the uncached case, 
and it just doesn't seem worth the potential extra L1 traffic (besides if 
you add in the implied __thread overhead it's definitely a loss).

here are the before/after results:

baselinepatched
no cachecache
k8 pre-revF21 14  16
k8 revF31 14  17
core2  38 12  17

sorry i don't have a handy EMT p4 on which i can install a 2.6.20-rc3 
kernel...  but based on userland-only comparisons of the sidt/lsl 
instructions i'll be amazed if this isn't a huge win on p4.

timing tools and test case can be found at 
http://arctic.org/~dean/vgetcpu/

-dean

Signed-off-by: dean gaudet [EMAIL PROTECTED]

Index: linux/arch/x86_64/kernel/time.c
===
--- linux.orig/arch/x86_64/kernel/time.c2007-01-06 13:31:10.0 
-0800
+++ linux/arch/x86_64/kernel/time.c 2007-01-06 16:04:01.0 -0800
@@ -957,11 +957,6 @@
if (unsynchronized_tsc())
notsc = 1;
 
-   if (cpu_has(boot_cpu_data, X86_FEATURE_RDTSCP))
-   vgetcpu_mode = VGETCPU_RDTSCP;
-   else
-   vgetcpu_mode = VGETCPU_LSL;
-
if (vxtime.hpet_address  notsc) {
timetype = hpet_use_timer ? HPET : PIT/HPET;
if (hpet_use_timer)
Index: linux/arch/x86_64/kernel/vsyscall.c
===
--- linux.orig/arch/x86_64/kernel/vsyscall.c2007-01-06 13:31:10.0 
-0800
+++ linux/arch/x86_64/kernel/vsyscall.c 2007-01-06 17:29:36.0 -0800
@@ -40,13 +40,18 @@
 #include asm/segment.h
 #include asm/desc.h
 #include asm/topology.h
+#include asm/desc.h
 
 #define __vsyscall(nr) __attribute__ ((unused,__section__(.vsyscall_ #nr)))
 #define __syscall_clobber r11,rcx,memory
 
 int __sysctl_vsyscall __section_sysctl_vsyscall = 1;
 seqlock_t __xtime_lock __section_xtime_lock = SEQLOCK_UNLOCKED;
-int __vgetcpu_mode __section_vgetcpu_mode;
+
+/* is this necessary? */
+#ifndef CONFIG_NODES_SHIFT
+#define CONFIG_NODES_SHIFT 0
+#endif
 
 #include asm/unistd.h
 
@@ -147,11 +152,21 @@
 long __vsyscall(2)
 vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
 {
-   unsigned int dummy, p;
+   unsigned int p;
+#ifdef VGETCPU_USE_SIDT
+   struct {
+   char pad[6];/* avoid unaligned stores */
+   u16 size;
+   u64 address;
+   } idt;
+
+   asm(sidt %0 : =m (idt.size));
+   p = idt.size - 0x1000;
+#else
unsigned long j = 0;
 
/* Fast cache - only recompute value once per jiffies and avoid
-  relatively costly rdtscp/cpuid otherwise.
+  relatively costly lsl otherwise.
   This works because the scheduler usually keeps

Re: [openssl.org #1447] [bug] 0.9.8d: rc4 cpuid test broken on dual core cpus

2007-01-04 Thread dean gaudet

On Fri, 5 Jan 2007, Andy Polyakov wrote:

  there is a cpuid test in rc4_skey.c which tests the hyperthreading cpuid bit
  to distinguish between two implementations of rc4... unfortunately this
  fails to properly distinguish the cpus.  all dual core cpus (intel or amd)
  report HT support even if they don't use symmetric-multithreading like some
  p4 do.
 
 So HT flag is no longer HyperThreading, but something else... Will look into
 it... There is another place HTT flag is checked and it's AES...

yeah HT flag now basically means multi-threading or multi-core
package... because when amd/intel went dual core they didn't want silly
license managers to charge for every core.

hmm i don't see any OPENSSL_ia32cap_P test for AES in 0.9.8d ... maybe
i should be looking at the cvs?  i'm seeing 17.5 cycles per byte for
aes-128-cbc on core2, which is pretty good.


  it seems somewhat fortunate that core2 CPUs track the p4 behaviour
  w.r.t. these two rc4 implementations.  here are the core2 results with the
  stock code / HT test:
  
  type 16 bytes 64 bytes256 bytes   1024 bytes   8192
  bytes
  rc4 166799.58k   180552.87k   182437.93k   183381.67k
  183206.87k
  
  and with cpuid test disabled:
  
  type 16 bytes 64 bytes256 bytes   1024 bytes   8192
  bytes
  rc4 123361.30k   128102.17k   129876.57k   128787.22k
  129419.95k
  
  for the record, core2 64-bit code seriously underperforming the 32-bit
  code...  here's the 32-bit results (with cpuid test enabled):
  
  type 16 bytes 64 bytes256 bytes   1024 bytes   8192
  bytes
  rc4 254164.64k   279901.10k   279364.38k   283617.62k
  276690.26k
 
 ... The key feature in 32-bit code with cpuid test is that corresponding loop
 is not unrolled. Can you test following in *64-bit* build on Core2 hardware.
 Open rc4-x86_64.pl in text editor and make jump to .Lcloop1 at line 154
 unconditional, i.e. replace jz to jmp. make, benchmark and report back. A.

small improvement...

type 16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
rc4 174197.47k   182564.34k   184536.23k   185292.63k   186258.77k

i think this hints that the problem with the unrolled code is the manual
load/store alias avoidance -- there's fancy new hardware in core2 for
dealing with this (obviously it's not fancy enough :)... and it seems
the 32-bit code pushes the alias problem onto the hardware.

oh and i tried using cmove with no luck either.

bizarre... i think i copied the 32-bit code into the 64-bit Lcloop1 case 
and it's still not performing like it does in 32-bit... maybe i screwed up 
though.

-dean
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   [EMAIL PROTECTED]

Re: replace "memset(...,0,PAGE_SIZE)" calls with "clear_page()"?

2007-01-02 Thread dean gaudet

On Sat, 30 Dec 2006, Denis Vlasenko wrote:

> I was experimenting with SSE[2] clear_page() which uses
> non-temporal stores. That one requires 16 byte alignment.
> 
> BTW, it worked ~300% faster than memset. But Andi Kleen
> insists that cache eviction caused by NT stores will make it
> slower in macrobenchmark.
> 
> Apart from fairly extensive set of microbechmarks
> I tested kernel compiles (i.e. "real world load")
> and they are FASTER too, not slower, but Andi
> is fairly entrenched in his opinion ;)
> I gave up.

you know, with the kernel zeroing pages through the 1:1 phys mapping, and 
userland accessing pages through a different mapping... it seems that 
frequently virtual address bits 12..14 will differ between user and 
kernel.

on K8 this results in a virtual alias conflict which costs *70 cycles* per 
cache line.  (K8 L1 DC uses virtual bits 12..14 as part of the index.)  
this is larger than the cost for L1 miss L2 hit...

this wouldn't happen with movnt... but then we get into the handwaving 
arguments about timing of accesses to the freshly zeroed page.  too bad 
there's no "evict from L1 to L2" operation -- that would avoid the virtual 
alias problem.

there's an event (75h unit mask 02h) to measure virtual alias conflicts... 
i've always wondered if there are workloads which trigger this behaviour. 
it can happy on copy to/from user as well.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: replace memset(...,0,PAGE_SIZE) calls with clear_page()?

2007-01-02 Thread dean gaudet

On Sat, 30 Dec 2006, Denis Vlasenko wrote:

 I was experimenting with SSE[2] clear_page() which uses
 non-temporal stores. That one requires 16 byte alignment.
 
 BTW, it worked ~300% faster than memset. But Andi Kleen
 insists that cache eviction caused by NT stores will make it
 slower in macrobenchmark.
 
 Apart from fairly extensive set of microbechmarks
 I tested kernel compiles (i.e. real world load)
 and they are FASTER too, not slower, but Andi
 is fairly entrenched in his opinion ;)
 I gave up.

you know, with the kernel zeroing pages through the 1:1 phys mapping, and 
userland accessing pages through a different mapping... it seems that 
frequently virtual address bits 12..14 will differ between user and 
kernel.

on K8 this results in a virtual alias conflict which costs *70 cycles* per 
cache line.  (K8 L1 DC uses virtual bits 12..14 as part of the index.)  
this is larger than the cost for L1 miss L2 hit...

this wouldn't happen with movnt... but then we get into the handwaving 
arguments about timing of accesses to the freshly zeroed page.  too bad 
there's no evict from L1 to L2 operation -- that would avoid the virtual 
alias problem.

there's an event (75h unit mask 02h) to measure virtual alias conflicts... 
i've always wondered if there are workloads which trigger this behaviour. 
it can happy on copy to/from user as well.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Bug#386357: please use -DUNALIGNED_OK on amd64

2007-01-01 Thread dean gaudet

On Mon, 1 Jan 2007, Mark Brown wrote:

 On Wed, Sep 06, 2006 at 05:51:39PM -0700, dean gaudet wrote:
 
  note that this define wasn't necessary on 32-bit x86 because there's 
  custom 32-bit assembly which uses unaligneds even more aggressively than 
  the C code does even when given UNALIGNED_OK.
 
 Which custom 32 bit assembly are you referring to here?

my apologies... i usually research my bug reports better.

these files have assembly:

./build-tree/zlib-1.2.3/contrib/asm586/match.S
./build-tree/zlib-1.2.3/contrib/asm686/match.S
./build-tree/zlib-1.2.3/contrib/inflate86/inffast.S

but it doesn't appear that they're actually being used.

and i can't even reproduce my results... here's the averages of the user
cpu seconds for 10 runs of minizip -9o a.zip linux-2.6.19.tar:

 baseline  -DUNALIGNED_OK
k8 revF26.62   26.59
core2  28.43   28.44

the differences are measurement noise... huh.

and similarly for miniunz:

 baseline  -DUNALIGNED_OK
k8 revF 1.291.30
core2   1.471.49

i wonder what i did differently the day i filed that report... i know
i saw an improvement that time :)

sorry for wasting your time... go ahead and close this out (unless you 
want to use it as a reminder to see if the 32-bit assembly helps...)

-dean


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Bug#386357: please use -DUNALIGNED_OK on amd64

2007-01-01 Thread dean gaudet

On Mon, 1 Jan 2007, dean gaudet wrote:

 and i can't even reproduce my results... here's the averages of the user
 cpu seconds for 10 runs of minizip -9o a.zip linux-2.6.19.tar:
 
  baseline  -DUNALIGNED_OK
 k8 revF26.62   26.59
 core2  28.43   28.44

you know, gzip -9 is noticably faster than zlib... and does show
the UNALIGNED_OK benefits i was claiming in the initial bug report.
aren't they roughly the same algorithms?  i wonder what optimizations
are in gzip which aren't in zlib.

 baseline  -DUNALIGNED_OK
k8 revF24.4824.08
core2  26.4224.41

-dean


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

TCP_DEFER_ACCEPT brokenness?

2006-12-30 Thread dean gaudet

hi... i'm having troubles matching up the tcp(7) man page description of 
TCP_DEFER_ACCEPT versus some comments in the kernel (2.6.20-rc2) versus 
how the kernel actually acts.

the man page says this:

   TCP_DEFER_ACCEPT
Allows a listener to be awakened only when data arrives on
the socket.  Takes an integer value (seconds), this can bound
the maximum number of attempts TCP will make to complete the
connection.  This option should not be used in code intended to
be portable.

which is a bit confusing because it talks both about seconds and
attempts.  (and doesn't mention what happens when the timeout finishes
-- i could see dropping the socket or passing it to userland anyhow as
possibilities... but in fact the socket is dropped).

the setsockopt code in tcp.c does this:

case TCP_DEFER_ACCEPT:
icsk-icsk_accept_queue.rskq_defer_accept = 0;
if (val  0) {
/* Translate value in seconds to number of
 * retransmits */
while (icsk-icsk_accept_queue.rskq_defer_accept  32 
   val  ((TCP_TIMEOUT_INIT / HZ) 
   
icsk-icsk_accept_queue.rskq_defer_accept))
icsk-icsk_accept_queue.rskq_defer_accept++;
icsk-icsk_accept_queue.rskq_defer_accept++;
}
break;

so at least the comment agrees with the man page -- however the code
doesn't... the code finds the least n such that val  (3n)...  but these
are timeouts and they're cumulative -- it would be more appropriate to
search for least n such that

val  (30) + (31) + (32) + ... + (3n)

but that's not all that's wrong... i'm not sure why, for val == 1 it
computes n=0 correctly (verified with getsockopt) but then it defers
way more timeouts than 2.  here's a tcpdump example where the timeout
was set to 1:

1167532741.446027 IP 127.0.0.1.56733  127.0.0.1.53846: S 
1792609127:1792609127(0) win 32792 mss 16396,sackOK,timestamp 249615 
0,nop,wscale 5
1167532741.446899 IP 127.0.0.1.53846  127.0.0.1.56733: S 
1785169552:1785169552(0) ack 1792609128 win 32768 mss 16396,sackOK,timestamp 
249616 249615,nop,wscale 5
1167532741.446122 IP 127.0.0.1.56733  127.0.0.1.53846: . ack 1 win 1025 
nop,nop,timestamp 249616 249616
1167532745.249902 IP 127.0.0.1.53846  127.0.0.1.56733: S 
1785169552:1785169552(0) ack 1792609128 win 32768 mss 16396,sackOK,timestamp 
250566 249616,nop,wscale 5
1167532745.249912 IP 127.0.0.1.56733  127.0.0.1.53846: . ack 1 win 1025 
nop,nop,timestamp 250566 250566,nop,nop,sack 1 {0:1}
1167532751.648046 IP 127.0.0.1.53846  127.0.0.1.56733: S 
1785169552:1785169552(0) ack 1792609128 win 32768 mss 16396,sackOK,timestamp 
252166 250566,nop,wscale 5
1167532751.648058 IP 127.0.0.1.56733  127.0.0.1.53846: . ack 1 win 1025 
nop,nop,timestamp 252166 252166,nop,nop,sack 1 {0:1}
1167532764.448456 IP 127.0.0.1.53846  127.0.0.1.56733: S 
1785169552:1785169552(0) ack 1792609128 win 32768 mss 16396,sackOK,timestamp 
255366 252166,nop,wscale 5
1167532764.448473 IP 127.0.0.1.56733  127.0.0.1.53846: . ack 1 win 1025 
nop,nop,timestamp 255366 255366,nop,nop,sack 1 {0:1}
1167532788.452409 IP 127.0.0.1.53846  127.0.0.1.56733: S 
1785169552:1785169552(0) ack 1792609128 win 32768 mss 16396,sackOK,timestamp 
261366 255366,nop,wscale 5
1167532788.452430 IP 127.0.0.1.56733  127.0.0.1.53846: . ack 1 win 1025 
nop,nop,timestamp 261366 261366,nop,nop,sack 1 {0:1}
1167532836.453520 IP 127.0.0.1.53846  127.0.0.1.56733: S 
1785169552:1785169552(0) ack 1792609128 win 32768 mss 16396,sackOK,timestamp 
273366 261366,nop,wscale 5
1167532836.453539 IP 127.0.0.1.56733  127.0.0.1.53846: . ack 1 win 1025 
nop,nop,timestamp 273366 273366,nop,nop,sack 1 {0:1}


now honestly i don't mind if 1s works correctly (because
apache 2.2.x is broken and sets TCP_DEFER_ACCEPT to 1 ... see
http://issues.apache.org/bugzilla/show_bug.cgi?id=41270).

but even if i use more reasonable timeouts like 30s it doesn't
behave as expected based on the docs.

not sure which way this should be resolved -- or how long the code has 
been like this...  perhaps the current behaviour should just become the 
documented behaviour (whatever the current behaviour is :).

-dean
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [rdiff-backup-users] [PATCH] Preserve symlink permissions

2006-12-29 Thread dean gaudet


On Fri, 29 Dec 2006, Andrew Ferguson wrote:

 dean gaudet wrote:
  btw -- adding two syscalls per symlink creation is a bit of a waste for 
  platforms where it doesn't matter.  any chance you'd consider adding a 
  test to fs_abilities and conditionalizing on it?
 
 Dean,
 
 I have added this test and a new symlink_perms global to avoid the
 unnecessary syscalls on platforms where it doesn't matter. The attached
 patch was applied to CVS this morning.

looks good... thanks!

-dean


___
rdiff-backup-users mailing list at rdiff-backup-users@nongnu.org
http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki

[openssl.org #1447] [bug] 0.9.8d: rc4 cpuid test broken on dual core cpus

2006-12-27 Thread dean gaudet via RT


there is a cpuid test in rc4_skey.c which tests the hyperthreading cpuid 
bit to distinguish between two implementations of rc4... unfortunately 
this fails to properly distinguish the cpus.  all dual core cpus (intel or 
amd) report HT support even if they don't use symmetric-multithreading 
like some p4 do.

on a dual-core k8 revF i see the following performance from a 0.9.8d build 
without any changes:

% ./openssl-0.9.8d speed rc4
Doing rc4 for 3s on 16 size blocks: 51091562 rc4's in 3.01s
Doing rc4 for 3s on 64 size blocks: 15937508 rc4's in 3.00s
Doing rc4 for 3s on 256 size blocks: 4190704 rc4's in 3.00s
Doing rc4 for 3s on 1024 size blocks: 1062795 rc4's in 3.00s
Doing rc4 for 3s on 8192 size blocks: 133319 rc4's in 3.01s
OpenSSL 0.9.8d 28 Sep 2006
built on: Tue Dec 26 17:40:14 PST 2006
options:bn(64,64) md2(int) rc4(ptr,int) des(idx,cisc,16,int) aes(partial) 
idea(int) blowfish(ptr2)
compiler: gcc -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -static 
-m64 -DL_ENDIAN -DTERMIO -O3 -Wall -DMD32_REG_T=int -DMD5_ASM
available timing options: TIMES TIMEB HZ=100 [sysconf value]
timing function used: times
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
rc4 271583.05k   34.17k   357606.74k   362767.36k   362840.28k

if i disable the cpuid test in rc4_skey.c i get these much improved
numbers:

type 16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
rc4 408832.88k   463675.26k   474736.30k   481802.21k   484870.83k

i see the same difference on dual-core k8 revE as well.


it seems somewhat fortunate that core2 CPUs track the p4 behaviour
w.r.t. these two rc4 implementations.  here are the core2 results with the
stock code / HT test:

type 16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
rc4 166799.58k   180552.87k   182437.93k   183381.67k   183206.87k

and with cpuid test disabled:

type 16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
rc4 123361.30k   128102.17k   129876.57k   128787.22k   129419.95k


i understand from the comments in rc4_skey.c that you're attempting to
distinguish between {p3, k8} and {p4}... with this updated information
it seems you want to distinguish {p3, k8} and {p4, core2}.  to do this
i'd suggest decoding the cpuid vendor, family and model values... but
this becomes unmaintainable really quickly:

if (vendor == intel  (family == 15 || (family == 6  model = 15))) {
// intel p4 and core2 only (and likely follow-ons to core2)
// XXX: need to test if core (model 14) should be here
}
else {
// everyone else
}

it seems a more sustainable solution would be some sort of
/etc/openssl.conf and an openssl speed --generate-conf option used at
package install time to test several implementations.

for the record, core2 64-bit code seriously underperforming the 32-bit
code...  here's the 32-bit results (with cpuid test enabled):

type 16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
rc4 254164.64k   279901.10k   279364.38k   283617.62k   276690.26k

sorry, i haven't developed patches to fix this... i just wanted to record
these results somewhere for now... i'm not even sure which approach is
the best to fix this.

-dean

__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   [EMAIL PROTECTED]

[patch] ifb double-counts packets

2006-12-23 Thread dean gaudet

it seems that ifb counts packets twice... both at xmit time and also in 
the tasklet.  i'm not sure which one of the two to drop, but here's a 
patch for dropping the counting at xmit time.

patch against 2.6.20-rc1.

-dean

Signed-off-by: dean gaudet [EMAIL PROTECTED]

Index: linux/drivers/net/ifb.c
===
--- linux.orig/drivers/net/ifb.c2006-11-29 13:57:37.0 -0800
+++ linux/drivers/net/ifb.c 2006-12-23 02:14:31.0 -0800
@@ -154,9 +154,6 @@
int ret = 0;
u32 from = G_TC_FROM(skb-tc_verd);
 
-   stats-tx_packets++;
-   stats-tx_bytes+=skb-len;
-
if (!from || !skb-input_dev) {
 dropped:
dev_kfree_skb(skb);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] ifb double-counts packets

2006-12-23 Thread dean gaudet

On Sat, 23 Dec 2006, jamal wrote:

 On Sat, 2006-23-12 at 02:35 -0800, dean gaudet wrote:
  it seems that ifb counts packets twice... both at xmit time and also in 
  the tasklet.  i'm not sure which one of the two to drop, but here's a 
  patch for dropping the counting at xmit time.
 
 Good catch but not quite right. The correct way to do it is to increment
 the rx_ counters instead of tx_ right at the top of  ifb_xmit().
 
 Do you wanna resubmit your patch with these chmages and hopefully tested
 for your situation?

heh yeah that makes more sense :)

-dean

Signed-off-by: dean gaudet [EMAIL PROTECTED]

Index: linux/drivers/net/ifb.c
===
--- linux.orig/drivers/net/ifb.c2006-11-29 13:57:37.0 -0800
+++ linux/drivers/net/ifb.c 2006-12-23 13:52:39.0 -0800
@@ -154,8 +154,8 @@
int ret = 0;
u32 from = G_TC_FROM(skb-tc_verd);
 
-   stats-tx_packets++;
-   stats-tx_bytes+=skb-len;
+   stats-rx_packets++;
+   stats-rx_bytes+=skb-len;
 
if (!from || !skb-input_dev) {
 dropped:
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread dean gaudet

On Mon, 18 Dec 2006, Linus Torvalds wrote:

> On Tue, 19 Dec 2006, Nick Piggin wrote:
> > 
> > We never want to drop dirty data! (ignoring the truncate case, which is
> > handled privately by truncate anyway)
> 
> Bzzt.
> 
> SURE we do.
> 
> We absolutely do want to drop dirty data in the writeout path.
> 
> How do you think dirty data ever _becomes_ clean data?
> 
> In other words, yes, we _do_ want to test-and-clear all the pgtable bits 
> _and_ the PG_dirty bit. We want to do it for:
>  - writeout
>  - truncate
>  - possibly a "drop" event (which could be a case for a journal entry that 
>becomes stale due to being replaced or something - kind of "truncate" 
>on metadata)
> 
> because both of those events _literally_ turn dirty state into clean 
> state.
> 
> In no other circumstance do we ever want to clear a dirty bit, as far as I 
> can tell. 

i admit this may not be entirely relevant, but it seems like a good place 
to bring up an old problem:  when a disk dies with lots of queued writes 
it can totally bring a system to its knees... even after the disk is 
removed.  i wrote up something about this a while ago:

http://lkml.org/lkml/2005/8/18/243

so there's another reason to "clear a dirty bit"... well, in fact -- drop 
the pages entirely.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread dean gaudet

On Mon, 18 Dec 2006, Linus Torvalds wrote:

 On Tue, 19 Dec 2006, Nick Piggin wrote:
  
  We never want to drop dirty data! (ignoring the truncate case, which is
  handled privately by truncate anyway)
 
 Bzzt.
 
 SURE we do.
 
 We absolutely do want to drop dirty data in the writeout path.
 
 How do you think dirty data ever _becomes_ clean data?
 
 In other words, yes, we _do_ want to test-and-clear all the pgtable bits 
 _and_ the PG_dirty bit. We want to do it for:
  - writeout
  - truncate
  - possibly a drop event (which could be a case for a journal entry that 
becomes stale due to being replaced or something - kind of truncate 
on metadata)
 
 because both of those events _literally_ turn dirty state into clean 
 state.
 
 In no other circumstance do we ever want to clear a dirty bit, as far as I 
 can tell. 

i admit this may not be entirely relevant, but it seems like a good place 
to bring up an old problem:  when a disk dies with lots of queued writes 
it can totally bring a system to its knees... even after the disk is 
removed.  i wrote up something about this a while ago:

http://lkml.org/lkml/2005/8/18/243

so there's another reason to clear a dirty bit... well, in fact -- drop 
the pages entirely.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [LARTC] Per-process QoS on Linux?

2006-12-15 Thread dean gaudet

i use a mixture of multiple IP addrs and IPTOS (see 
http://arctic.org/~dean/mod_iptos/ for an apache 1.3.x module to set IPTOS 
on a per response basis).

but for uid specifically you can also use iptables blahblah -m owner 
--uid-owner $uid -j MARK --set-mark N and then match the mark with tc.

tc filter add dev $foo protocol ip parent 1: prio X handle N fw flowid A:B

-dean

On Sat, 16 Dec 2006, Alan Franzoni wrote:

 Hello,
 I've tried searching for this but I don't seem to be able to find a way to
 search past archives in this list.
 
 Is there a way to get a per-process qos functionality in linux? At this very
 moment, I'm using with success a kind of 'workaround' in my server, which
 involves creating multiple virtual ethernet interfaces with different IPs
 and binding servers/daemons to different IPs.
 
 Now, I'd like to use qos on my desktop as well, so I'd like to give a low
 traffic priority to one software, and an higher one to another... is there
 any way to get that accomplished?
 
 -- 
 Alan Franzoni [EMAIL PROTECTED]
 -
 Togli .xyz dalla mia email per contattarmi.
 Remove .xyz from my address in order to contact me.
 -
 GPG Key Fingerprint (Key ID = FE068F3E):
 5C77 9DC3 BD5B 3A28 E7BC 921A 0255 42AA FE06 8F3E
 
___
LARTC mailing list
LARTC@mailman.ds9a.nl
http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc

Re: [stable] [PATCH 46/61] fix Intel RNG detection

2006-12-14 Thread dean gaudet

On Thu, 14 Dec 2006, Jan Beulich wrote:

> >with the patch it boots perfectly without any command-line args.
> 
> Are you getting the 'Firmware space is locked read-only' message then?

yep...

so let me ask a naive question... don't we want the firmware locked 
read-only because that protects the bios from viruses?  honestly i'm naive 
in this area of pc hardware, but i'm kind of confused why we'd want 
unlocked firmware just so we can detect a RNG.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [stable] [PATCH 46/61] fix Intel RNG detection

2006-12-14 Thread dean gaudet

On Thu, 14 Dec 2006, Jan Beulich wrote:

 with the patch it boots perfectly without any command-line args.
 
 Are you getting the 'Firmware space is locked read-only' message then?

yep...

so let me ask a naive question... don't we want the firmware locked 
read-only because that protects the bios from viruses?  honestly i'm naive 
in this area of pc hardware, but i'm kind of confused why we'd want 
unlocked firmware just so we can detect a RNG.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [stable] [PATCH 46/61] fix Intel RNG detection

2006-12-13 Thread dean gaudet

On Wed, 13 Dec 2006, Chris Wright wrote:

> * dean gaudet ([EMAIL PROTECTED]) wrote:
> > just for the public record (i already communicated with Jan in private 
> > mail on this one)... i have a box which hangs hard starting at 2.6.18.2 
> > and 2.6.19 -- hangs hard during the intel hw rng tests (no sysrq 
> > response).  and the hang occurs prior to the printk so it took some 
> > digging to figure out which module was taking out the system.
> > 
> > Jan's patch gets the box past the hang... it seems like this should be in 
> > at least the next 2.6.19.x stable (and if there's going to be another 
> > 2.6.18.x stable then it should be included there as well).
> 
> Thanks for the data point.  I wonder if you get SMI and never come back.
> Do you boot with no_fwh_detect=1 or -1?

with the patch it boots perfectly without any command-line args.

without the patch it crashes after the "4" and before the "5" in this 
hacked up segment of the code:

if (!(fwh_dec_en1_val & FWH_F8_EN_MASK))
pci_write_config_byte(dev,
  fwh_dec_en1_off,
  fwh_dec_en1_val | FWH_F8_EN_MASK);
if (!(bios_cntl_val &
  (BIOS_CNTL_LOCK_ENABLE_MASK|BIOS_CNTL_WRITE_ENABLE_MASK)))
pci_write_config_byte(dev,
  bios_cntl_off,
  bios_cntl_val | 
BIOS_CNTL_WRITE_ENABLE_MASK);

printk(KERN_INFO "intel-rng: 4\n");
writeb(INTEL_FWH_RESET_CMD, mem);
printk(KERN_INFO "intel-rng: 5\n");

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [stable] [PATCH 46/61] fix Intel RNG detection

2006-12-13 Thread dean gaudet

On Wed, 29 Nov 2006, Jan Beulich wrote:

> >>> Dave Jones <[EMAIL PROTECTED]> 24.11.06 21:27 >>>
> >On Wed, Nov 22, 2006 at 08:53:08AM +0100, Jan Beulich wrote:
> > > >It does appear to work w/out the patch.  I've asked for a small bit
> > > >of diagnostics (below), perhaps you've got something you'd rather see?
> > > >I expect this to be a 24C0 LPC Bridge.
> > > 
> > > Yes, that's what I'd have asked for. If it works, I expect the device
> > > code to be different, or both manufacturer and device code to be
> > > invalid. Depending on the outcome, perhaps we'll need an override
> > > option so that this test can be partially (i.e. just the device code
> > > part) or entirely (all the FWH detection) skipped.
> > > The base problem is the vague documentation of the whole
> > > detection mechanism - a lot of this I had to read between the lines.
> >
> >The bug report I referenced came back with this from that debug patch..
> >
> >intel_rng: no version for "struct_module" found: kernel tainted.
> >intel_rng: pci vendor:device 8086:24c0 fwh_dec_en1 80 bios_cntl_val 2 mfc cb 
> >dvc 88
> >intel_rng: FWH not detected
> 
> Any chance you could have them test below patch (perhaps before I
> actually submit it)? They should see the warning message added when
> not using any options, and they should then be able to use the
> no_fwh_detect option to get the thing to work again.
> 
> I'll meanwhile ask Intel about how they suppose to follow the RNG
> detection sequence when the BIOS locks out write access to the
> FWH interface.

just for the public record (i already communicated with Jan in private 
mail on this one)... i have a box which hangs hard starting at 2.6.18.2 
and 2.6.19 -- hangs hard during the intel hw rng tests (no sysrq 
response).  and the hang occurs prior to the printk so it took some 
digging to figure out which module was taking out the system.

Jan's patch gets the box past the hang... it seems like this should be in 
at least the next 2.6.19.x stable (and if there's going to be another 
2.6.18.x stable then it should be included there as well).

there is apparently no hw rng on this box (returns all 0xff).

thanks
-dean

> 
> Jan
> 
> Index: head-2006-11-21/drivers/char/hw_random/intel-rng.c
> ===
> --- head-2006-11-21.orig/drivers/char/hw_random/intel-rng.c   2006-11-21 
> 10:36:15.0 +0100
> +++ head-2006-11-21/drivers/char/hw_random/intel-rng.c2006-11-29 
> 09:09:21.0 +0100
> @@ -143,6 +143,8 @@ static const struct pci_device_id pci_tb
>  };
>  MODULE_DEVICE_TABLE(pci, pci_tbl);
>  
> +static __initdata int no_fwh_detect;
> +module_param(no_fwh_detect, int, 0);
>  
>  static inline u8 hwstatus_get(void __iomem *mem)
>  {
> @@ -240,6 +242,11 @@ static int __init mod_init(void)
>   if (!dev)
>   goto out; /* Device not found. */
>  
> + if (no_fwh_detect < 0) {
> + pci_dev_put(dev);
> + goto fwh_done;
> + }
> +
>   /* Check for Intel 82802 */
>   if (dev->device < 0x2640) {
>   fwh_dec_en1_off = FWH_DEC_EN1_REG_OLD;
> @@ -252,6 +259,23 @@ static int __init mod_init(void)
>   pci_read_config_byte(dev, fwh_dec_en1_off, _dec_en1_val);
>   pci_read_config_byte(dev, bios_cntl_off, _cntl_val);
>  
> + if ((bios_cntl_val &
> +  (BIOS_CNTL_LOCK_ENABLE_MASK|BIOS_CNTL_WRITE_ENABLE_MASK))
> + == BIOS_CNTL_LOCK_ENABLE_MASK) {
> + static __initdata /*const*/ char warning[] =
> + KERN_WARNING PFX "Firmware space is locked read-only. 
> If you can't or\n"
> + KERN_WARNING PFX "don't want to disable this in 
> firmware setup, and if\n"
> + KERN_WARNING PFX "you are certain that your system has 
> a functional\n"
> + KERN_WARNING PFX "RNG, try using the 'no_fwh_detect' 
> option.\n";
> +
> + pci_dev_put(dev);
> + if (no_fwh_detect)
> + goto fwh_done;
> + printk(warning);
> + err = -EBUSY;
> + goto out;
> + }
> +
>   mem = ioremap_nocache(INTEL_FWH_ADDR, INTEL_FWH_ADDR_LEN);
>   if (mem == NULL) {
>   pci_dev_put(dev);
> @@ -280,8 +304,7 @@ static int __init mod_init(void)
>   pci_write_config_byte(dev,
> fwh_dec_en1_off,
> fwh_dec_en1_val | FWH_F8_EN_MASK);
> - if (!(bios_cntl_val &
> -   (BIOS_CNTL_LOCK_ENABLE_MASK|BIOS_CNTL_WRITE_ENABLE_MASK)))
> + if (!(bios_cntl_val & BIOS_CNTL_WRITE_ENABLE_MASK))
>   pci_write_config_byte(dev,
> bios_cntl_off,
> bios_cntl_val | 
> BIOS_CNTL_WRITE_ENABLE_MASK);
> @@ -315,6 +338,8 @@ static int __init mod_init(void)
>   goto out;
>   }
>  
> +fwh_done:
> +
>   err = -ENOMEM;
>   mem =

Re: [stable] [PATCH 46/61] fix Intel RNG detection

2006-12-13 Thread dean gaudet

On Wed, 29 Nov 2006, Jan Beulich wrote:

  Dave Jones [EMAIL PROTECTED] 24.11.06 21:27 
 On Wed, Nov 22, 2006 at 08:53:08AM +0100, Jan Beulich wrote:
   It does appear to work w/out the patch.  I've asked for a small bit
   of diagnostics (below), perhaps you've got something you'd rather see?
   I expect this to be a 24C0 LPC Bridge.
   
   Yes, that's what I'd have asked for. If it works, I expect the device
   code to be different, or both manufacturer and device code to be
   invalid. Depending on the outcome, perhaps we'll need an override
   option so that this test can be partially (i.e. just the device code
   part) or entirely (all the FWH detection) skipped.
   The base problem is the vague documentation of the whole
   detection mechanism - a lot of this I had to read between the lines.
 
 The bug report I referenced came back with this from that debug patch..
 
 intel_rng: no version for struct_module found: kernel tainted.
 intel_rng: pci vendor:device 8086:24c0 fwh_dec_en1 80 bios_cntl_val 2 mfc cb 
 dvc 88
 intel_rng: FWH not detected
 
 Any chance you could have them test below patch (perhaps before I
 actually submit it)? They should see the warning message added when
 not using any options, and they should then be able to use the
 no_fwh_detect option to get the thing to work again.
 
 I'll meanwhile ask Intel about how they suppose to follow the RNG
 detection sequence when the BIOS locks out write access to the
 FWH interface.

just for the public record (i already communicated with Jan in private 
mail on this one)... i have a box which hangs hard starting at 2.6.18.2 
and 2.6.19 -- hangs hard during the intel hw rng tests (no sysrq 
response).  and the hang occurs prior to the printk so it took some 
digging to figure out which module was taking out the system.

Jan's patch gets the box past the hang... it seems like this should be in 
at least the next 2.6.19.x stable (and if there's going to be another 
2.6.18.x stable then it should be included there as well).

there is apparently no hw rng on this box (returns all 0xff).

thanks
-dean

 
 Jan
 
 Index: head-2006-11-21/drivers/char/hw_random/intel-rng.c
 ===
 --- head-2006-11-21.orig/drivers/char/hw_random/intel-rng.c   2006-11-21 
 10:36:15.0 +0100
 +++ head-2006-11-21/drivers/char/hw_random/intel-rng.c2006-11-29 
 09:09:21.0 +0100
 @@ -143,6 +143,8 @@ static const struct pci_device_id pci_tb
  };
  MODULE_DEVICE_TABLE(pci, pci_tbl);
  
 +static __initdata int no_fwh_detect;
 +module_param(no_fwh_detect, int, 0);
  
  static inline u8 hwstatus_get(void __iomem *mem)
  {
 @@ -240,6 +242,11 @@ static int __init mod_init(void)
   if (!dev)
   goto out; /* Device not found. */
  
 + if (no_fwh_detect  0) {
 + pci_dev_put(dev);
 + goto fwh_done;
 + }
 +
   /* Check for Intel 82802 */
   if (dev-device  0x2640) {
   fwh_dec_en1_off = FWH_DEC_EN1_REG_OLD;
 @@ -252,6 +259,23 @@ static int __init mod_init(void)
   pci_read_config_byte(dev, fwh_dec_en1_off, fwh_dec_en1_val);
   pci_read_config_byte(dev, bios_cntl_off, bios_cntl_val);
  
 + if ((bios_cntl_val 
 +  (BIOS_CNTL_LOCK_ENABLE_MASK|BIOS_CNTL_WRITE_ENABLE_MASK))
 + == BIOS_CNTL_LOCK_ENABLE_MASK) {
 + static __initdata /*const*/ char warning[] =
 + KERN_WARNING PFX Firmware space is locked read-only. 
 If you can't or\n
 + KERN_WARNING PFX don't want to disable this in 
 firmware setup, and if\n
 + KERN_WARNING PFX you are certain that your system has 
 a functional\n
 + KERN_WARNING PFX RNG, try using the 'no_fwh_detect' 
 option.\n;
 +
 + pci_dev_put(dev);
 + if (no_fwh_detect)
 + goto fwh_done;
 + printk(warning);
 + err = -EBUSY;
 + goto out;
 + }
 +
   mem = ioremap_nocache(INTEL_FWH_ADDR, INTEL_FWH_ADDR_LEN);
   if (mem == NULL) {
   pci_dev_put(dev);
 @@ -280,8 +304,7 @@ static int __init mod_init(void)
   pci_write_config_byte(dev,
 fwh_dec_en1_off,
 fwh_dec_en1_val | FWH_F8_EN_MASK);
 - if (!(bios_cntl_val 
 -   (BIOS_CNTL_LOCK_ENABLE_MASK|BIOS_CNTL_WRITE_ENABLE_MASK)))
 + if (!(bios_cntl_val  BIOS_CNTL_WRITE_ENABLE_MASK))
   pci_write_config_byte(dev,
 bios_cntl_off,
 bios_cntl_val | 
 BIOS_CNTL_WRITE_ENABLE_MASK);
 @@ -315,6 +338,8 @@ static int __init mod_init(void)
   goto out;
   }
  
 +fwh_done:
 +
   err = -ENOMEM;
   mem = ioremap(INTEL_RNG_ADDR, INTEL_RNG_ADDR_LEN);
   if (!mem)
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to

Re: [stable] [PATCH 46/61] fix Intel RNG detection

2006-12-13 Thread dean gaudet

On Wed, 13 Dec 2006, Chris Wright wrote:

 * dean gaudet ([EMAIL PROTECTED]) wrote:
  just for the public record (i already communicated with Jan in private 
  mail on this one)... i have a box which hangs hard starting at 2.6.18.2 
  and 2.6.19 -- hangs hard during the intel hw rng tests (no sysrq 
  response).  and the hang occurs prior to the printk so it took some 
  digging to figure out which module was taking out the system.
  
  Jan's patch gets the box past the hang... it seems like this should be in 
  at least the next 2.6.19.x stable (and if there's going to be another 
  2.6.18.x stable then it should be included there as well).
 
 Thanks for the data point.  I wonder if you get SMI and never come back.
 Do you boot with no_fwh_detect=1 or -1?

with the patch it boots perfectly without any command-line args.

without the patch it crashes after the 4 and before the 5 in this 
hacked up segment of the code:

if (!(fwh_dec_en1_val  FWH_F8_EN_MASK))
pci_write_config_byte(dev,
  fwh_dec_en1_off,
  fwh_dec_en1_val | FWH_F8_EN_MASK);
if (!(bios_cntl_val 
  (BIOS_CNTL_LOCK_ENABLE_MASK|BIOS_CNTL_WRITE_ENABLE_MASK)))
pci_write_config_byte(dev,
  bios_cntl_off,
  bios_cntl_val | 
BIOS_CNTL_WRITE_ENABLE_MASK);

printk(KERN_INFO intel-rng: 4\n);
writeb(INTEL_FWH_RESET_CMD, mem);
printk(KERN_INFO intel-rng: 5\n);

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Shrinking a RAID1--superblock problems

2006-12-12 Thread dean gaudet

On Tue, 12 Dec 2006, Jonathan Terhorst wrote:

 I need to shrink a RAID1 array and am having trouble with the
 persistent superblock; namely, mdadm --grow doesn't seem to relocate
 it. If I downsize the array and then shrink the corresponding
 partitions, the array fails since the superblock (which is normally
 located near the end of the device) now lays outside of the
 partitions. Is there any easier way to deal with this than digging
 into the mdadm source, manually calculating the superblock offset and
 dd'ing it to the right spot?

i'd think it'd be easier to recreate the array using --assume-clean after 
the shrink.  for raid1 it's extra easy because you don't need to get the 
disk ordering correct.

in fact with raid1 you don't even need to use mdadm --grow... you could do 
something like the following (assuming you've already shrunk the 
filesystem):

mdadm --stop /dev/md0
mdadm --zero-superblock /dev/sda1
mdadm --zero-superblock /dev/sdb1
fdisk /dev/sda  ... shrink partition
fdisk /dev/sdb  ... shrink partition
mdadm --create --assume-clean --level=1 -n2 /dev/md0 /dev/sd[ab]1

heck that same technique works for raid0/4/5/6 and raid10 near and 
offset layouts as well, doesn't it?  raid10 far layout definitely 
needs blocks rearranged to shrink.  in these other modes you'd need to be 
careful about recreating the array with the correct ordering of disks.

the zero-superblock step is an important defense against future problems 
with assemble every array i find-types of initrds that are unfortunately 
becomming common (i.e. debian and ubuntu).

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: rdtscp vgettimeofday

2006-12-11 Thread dean gaudet

On Mon, 11 Dec 2006, Andrea Arcangeli wrote:

> On Mon, Dec 11, 2006 at 01:17:25PM -0800, dean gaudet wrote:
> > rdtscp doesn't solve anything extra [..]
> > [..] lsl-based vgetcpu is relatively slow
> 
> Well, if you accept to run slow there's nothing to solve in the first
> place indeed.
> 
> If nothing else rdtscp should avoid the mess of restarting a
> vsyscalls, which is quite a difficult problem as it heavily depends on
> the compiler/dwarf.

rdtscp gets you 2 of the 5 values you need to compute the time.  anything 
can happen between when you do the rdtscp and do the other 3 reads:  the 
computation is (((tsc-A)*B)>>N)+C where N is a constant, and A, B, C are 
per-cpu data.

A/B/C change a few times a second (to avoid 32-bit rollover in (tsc-A)), 
every time there's a halt, and every P-state transition.

if you lose your tick in the middle of those reads any number of things 
can happen to screw the computation... including being scheduled on 
another core and mixing values from two cores.

> > even with rdtscp you have to deal with the definite possibility of being 
> > scheduled away in the middle of the computation.  arguably you need
> > to 
> 
> Isn't rdtscp atomic? all you need is to read atomically the current
> contents of the tsc and the index to use in a per-cpu table exported
> in readonly. This table will contain a per-cpu seqlock as well. Then a
> math logic has to be built with per-cpu threads, so that those per-cpu
> tables are updated by cpufreq and at regular intervals.
> 
> If this is all wrong and it's not feasible to implement a safe and
> monothonic vgettimeofday that doesn't access the southbridge and that
> doesn't require restarting the vsyscall manually by patching rip/rsp,
> I've an hard time to see how rdtscp is useful at all. I hope somebody
> thought about those issues before adding a new instruction to a
> popular CPU ;).

oh i think there are several solutions which will work... and i also think 
rdtscp wasn't a necessary addition to the ISA :)

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rdtscp vgettimeofday

2006-12-11 Thread dean gaudet

On Mon, 11 Dec 2006, Andrea Arcangeli wrote:

> As far as I can see, many changes happened but nobody has yet added
> the rdtscp support to x86-64. rdtscp finally solves the problem and it
> obsoletes hpet for timekeeping and it allows a fully userland
> gettimeofday running at maximum speed in userland.

rdtscp doesn't solve anything extra which can't already be solved with 
existing vgetcpu (based on lsl) and rdtsc.  which have the advantage of 
working on all x86, not just the (currently) rare revF opteron.

lsl-based vgetcpu is relatively slow (because it is a protected 
instruction with lots of microcode) -- but there are other options which 
continue to work on all x86 (see ).

> Before rdtscp we could never index the rdtsc offset in a proper index
> without being in kernel with preemption disabled, so it could never
> work reliably.

even with rdtscp you have to deal with the definite possibility of being 
scheduled away in the middle of the computation.  arguably you need to 
deal with the possibility of being scheduled away *and* back again to the 
same cpu (so testing cpu# at top and bottom of a loop isn't sufficient).

suleiman proposed a per-cpu scheduling event number to deal with that... 
not sure what folks think of that idea.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rdtscp vgettimeofday

2006-12-11 Thread dean gaudet

On Mon, 11 Dec 2006, Andrea Arcangeli wrote:

 As far as I can see, many changes happened but nobody has yet added
 the rdtscp support to x86-64. rdtscp finally solves the problem and it
 obsoletes hpet for timekeeping and it allows a fully userland
 gettimeofday running at maximum speed in userland.

rdtscp doesn't solve anything extra which can't already be solved with 
existing vgetcpu (based on lsl) and rdtsc.  which have the advantage of 
working on all x86, not just the (currently) rare revF opteron.

lsl-based vgetcpu is relatively slow (because it is a protected 
instruction with lots of microcode) -- but there are other options which 
continue to work on all x86 (see http://lkml.org/lkml/2006/11/13/401).


 Before rdtscp we could never index the rdtsc offset in a proper index
 without being in kernel with preemption disabled, so it could never
 work reliably.

even with rdtscp you have to deal with the definite possibility of being 
scheduled away in the middle of the computation.  arguably you need to 
deal with the possibility of being scheduled away *and* back again to the 
same cpu (so testing cpu# at top and bottom of a loop isn't sufficient).

suleiman proposed a per-cpu scheduling event number to deal with that... 
not sure what folks think of that idea.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: rdtscp vgettimeofday

2006-12-11 Thread dean gaudet

On Mon, 11 Dec 2006, Andrea Arcangeli wrote:

 On Mon, Dec 11, 2006 at 01:17:25PM -0800, dean gaudet wrote:
  rdtscp doesn't solve anything extra [..]
  [..] lsl-based vgetcpu is relatively slow
 
 Well, if you accept to run slow there's nothing to solve in the first
 place indeed.
 
 If nothing else rdtscp should avoid the mess of restarting a
 vsyscalls, which is quite a difficult problem as it heavily depends on
 the compiler/dwarf.

rdtscp gets you 2 of the 5 values you need to compute the time.  anything 
can happen between when you do the rdtscp and do the other 3 reads:  the 
computation is (((tsc-A)*B)N)+C where N is a constant, and A, B, C are 
per-cpu data.

A/B/C change a few times a second (to avoid 32-bit rollover in (tsc-A)), 
every time there's a halt, and every P-state transition.

if you lose your tick in the middle of those reads any number of things 
can happen to screw the computation... including being scheduled on 
another core and mixing values from two cores.


  even with rdtscp you have to deal with the definite possibility of being 
  scheduled away in the middle of the computation.  arguably you need
  to 
 
 Isn't rdtscp atomic? all you need is to read atomically the current
 contents of the tsc and the index to use in a per-cpu table exported
 in readonly. This table will contain a per-cpu seqlock as well. Then a
 math logic has to be built with per-cpu threads, so that those per-cpu
 tables are updated by cpufreq and at regular intervals.
 
 If this is all wrong and it's not feasible to implement a safe and
 monothonic vgettimeofday that doesn't access the southbridge and that
 doesn't require restarting the vsyscall manually by patching rip/rsp,
 I've an hard time to see how rdtscp is useful at all. I hope somebody
 thought about those issues before adding a new instruction to a
 popular CPU ;).

oh i think there are several solutions which will work... and i also think 
rdtscp wasn't a necessary addition to the ISA :)

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Bug#390038: this is caused by the use of /sbin/update-grub

2006-12-03 Thread dean gaudet



On Sun, 3 Dec 2006, Frans Pop wrote:

 On Sunday 03 December 2006 22:34, dean gaudet wrote:
  the linux-image .postrm script is (through some mechanism) invoking
  /sbin/update-grub.
 
  /sbin/update-grub gives a warning now:
 
  You shouldn't call /sbin/update-grub. Please call /usr/sbin/update-grub
  instead!
 
  except that warning is sent on stdout.
 
 I guess another solution for this issue would be if the wrappers in the 
 grub packages wrote these messages to stderr instead of stdout.
 IMO this would be better anyway.

that doesn't seem appropriate... update-grub isn't the only tool which is 
invoked -- arbitrary user hooks are invoked, plus dozens of other 
executables.  isn't it wrong to pass them the IPC pipe on stdin?  i 
honestly don't know what the parent/child split is all about, so maybe the 
pipe is there for a good reason and there should be an update-grub wrapper 
specific to how this postrm script wants to work.

see the strace fragments below for execve calls.  that's a lot of tools 
which could erroneously stamp on stdin and cause problems.


  - something messed up /etc/kernel-img.conf and didn't put the /usr/sbin
    paths on the hooks... if someone has a rc1-installed box please take
    a peek in there to see if it has been fixed.
 
 New installs write the lines in kernel-img.conf without any path (i.e. 
 just the command).

that sounds at odds then with the path ordering in the postrm and the 
/sbin/update-grub wrapper.  installer probably should put 
/usr/sbin/update-grub in there... (or the /sbin wrapper should go away... 
but istr there's a bug regarding why it was reinstated.)

-dean

21884 execve(/var/lib/dpkg/info/linux-image-2.6.18-1-686.postrm.real, 
[/var/lib/dpkg/info/linux-image-2..., purge], [/* 28 vars */]) = 0
21884 execve(/usr/share/debconf/frontend, [/usr/share/debconf/frontend, 
/var/lib/dpkg/info/linux-image-2..., purge], [/* 29 vars */]) = 0
21885 execve(/home/dean/local/bin/locale, [locale, charmap], [/* 29 vars 
*/]) = -1 ENOENT (No such file or directory)
21885 execve(/usr/local/bin/locale, [locale, charmap], [/* 29 vars */]) = 
-1 ENOENT (No such file or directory)
21885 execve(/usr/local/sbin/locale, [locale, charmap], [/* 29 vars */]) 
= -1 ENOENT (No such file or directory)
21885 execve(/usr/bin/locale, [locale, charmap], [/* 29 vars */] 
unfinished ...
21885 ... execve resumed )= 0
21886 execve(/bin/sh, [sh, -c, stty -a 2/dev/null], [/* 29 vars */] 
unfinished ...
21886 ... execve resumed )= 0
21887 execve(/bin/stty, [stty, -a], [/* 29 vars */]) = 0
21888 execve(/bin/sh, [sh, -c, stty -a 2/dev/null], [/* 29 vars */] 
unfinished ...
21888 ... execve resumed )= 0
21889 execve(/bin/stty, [stty, -a], [/* 29 vars */]) = 0
21890 execve(/var/lib/dpkg/info/linux-image-2.6.18-1-686.postrm.real, 
[/var/lib/dpkg/info/linux-image-2..., purge], [/* 30 vars */]) = 0
21891 execve(/sbin/update-grub, [/sbin/update-grub, 2.6.18-1-686, 
/boot/vmlinuz-2.6.18-1-686], [/* 32 vars */]) = 0
21892 execve(/bin/grep, [grep, -q,   */sbin/update-grub$, 
/etc/kernel-img.conf], [/* 32 vars */]) = 0
21891 execve(/usr/sbin/update-grub, [/usr/sbin/update-grub, 2.6.18-1-686, 
/boot/vmlinuz-2.6.18-1-686], [/* 32 vars */]) = 0
21894 execve(/bin/uname, [uname, -s], [/* 31 vars */] unfinished ...
21895 execve(/usr/bin/tr, [tr, [A-Z], [a-z]], [/* 31 vars */] 
unfinished ...
21894 ... execve resumed )= 0
21895 ... execve resumed )= 0
21901 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21901 ... execve resumed )= 0
21904 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21904 ... execve resumed )= 0
21907 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */]) = 0
21910 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21910 ... execve resumed )= 0
21913 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */]) = 0
21916 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21916 ... execve resumed )= 0
21919 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21919 ... execve resumed )= 0
21922 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */]) = 0
21925 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21925 ... execve resumed )= 0
21928 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21928 ... execve resumed )= 0
21931 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21931 ... execve resumed )= 0
21934 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21934 ... execve resumed )= 0
21937 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21937 ... execve resumed )= 0
21940 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21940 ... execve resumed )= 0
21942 execve(/bin/readlink, [readlink, -f, /dev/md3], [/* 31 vars */]) 
= 0
21946 execve(/bin/grep

Bug#390038: this is caused by the use of /sbin/update-grub

2006-12-03 Thread dean gaudet

the linux-image .postrm script is (through some mechanism) invoking 
/sbin/update-grub.

/sbin/update-grub gives a warning now:

You shouldn't call /sbin/update-grub. Please call /usr/sbin/update-grub instead!

except that warning is sent on stdout.

stdout in the .postrm script is hooked up through some pipe for
some sort of parent/child communication and the result is that
this /sbin/update-grub warning ends up being sent to the parent,
and the parent doesn't like it:

21891 write(1, You shouldn\'t call /sbin/update-..., 81) = 81
21884 ... read resumed You shouldn\'t call /sbin/update-..., 4096) =
81
21884 write(7, 20 Unsupported command \you\ (fu..., 154) = 154

this is what eventually causes the exit 128.

the real problem is the kernel .postrm mechanisms which are spawning
a zillion children with stdout still hooked up to the IPC mechanism.
please dup stderr ontop of stdout and reopen stdin from /dev/null for
children.

so why is the .postrm invoking /sbin/update-grub?  in my case it's
because of the contents of my /etc/kernel-img.conf:

postinst_hook = update-grub
postrm_hook   = update-grub

despite the fact that /sbin trails /usr/sbin in my PATH there's code in
.postrm which overrides the PATH:

for my $path ('/bin', '/sbin', '/usr/bin', '/usr/sbin') {
  if (-x $path/$script) {
exec_script($type, $path/$script);
return 0;
  }
}

thus ensuring that the silly /sbin/update-grub wrapper is invoked.

the grub package suggests those hooks should refer to
/usr/sbin/update-grub -- which would stop the kernel .postrm from screwing
up (but would hide the stdin/out IPC bug).

my most recently freshly installed box (from etch beta3 installer) has
a kernel-img.conf without the full pathnames on update-grub... i'm not
sure who is responsible for messing those up.  maybe this is fixed in
rc1 installer?  dunno.

anyhow, to summarize:

- kernel .postrm script needs to be more careful with its spawned
  children so as to not screw up its IPC mechanism

- something messed up /etc/kernel-img.conf and didn't put the /usr/sbin
  paths on the hooks... if someone has a rc1-installed box please take
  a peek in there to see if it has been fixed.

-dean


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Bug#390038: this is caused by the use of /sbin/update-grub

2006-12-03 Thread dean gaudet



On Sun, 3 Dec 2006, Frans Pop wrote:

 On Sunday 03 December 2006 22:34, dean gaudet wrote:
  the linux-image .postrm script is (through some mechanism) invoking
  /sbin/update-grub.
 
  /sbin/update-grub gives a warning now:
 
  You shouldn't call /sbin/update-grub. Please call /usr/sbin/update-grub
  instead!
 
  except that warning is sent on stdout.
 
 I guess another solution for this issue would be if the wrappers in the 
 grub packages wrote these messages to stderr instead of stdout.
 IMO this would be better anyway.

that doesn't seem appropriate... update-grub isn't the only tool which is 
invoked -- arbitrary user hooks are invoked, plus dozens of other 
executables.  isn't it wrong to pass them the IPC pipe on stdin?  i 
honestly don't know what the parent/child split is all about, so maybe the 
pipe is there for a good reason and there should be an update-grub wrapper 
specific to how this postrm script wants to work.

see the strace fragments below for execve calls.  that's a lot of tools 
which could erroneously stamp on stdin and cause problems.


  - something messed up /etc/kernel-img.conf and didn't put the /usr/sbin
    paths on the hooks... if someone has a rc1-installed box please take
    a peek in there to see if it has been fixed.
 
 New installs write the lines in kernel-img.conf without any path (i.e. 
 just the command).

that sounds at odds then with the path ordering in the postrm and the 
/sbin/update-grub wrapper.  installer probably should put 
/usr/sbin/update-grub in there... (or the /sbin wrapper should go away... 
but istr there's a bug regarding why it was reinstated.)

-dean

21884 execve(/var/lib/dpkg/info/linux-image-2.6.18-1-686.postrm.real, 
[/var/lib/dpkg/info/linux-image-2..., purge], [/* 28 vars */]) = 0
21884 execve(/usr/share/debconf/frontend, [/usr/share/debconf/frontend, 
/var/lib/dpkg/info/linux-image-2..., purge], [/* 29 vars */]) = 0
21885 execve(/home/dean/local/bin/locale, [locale, charmap], [/* 29 vars 
*/]) = -1 ENOENT (No such file or directory)
21885 execve(/usr/local/bin/locale, [locale, charmap], [/* 29 vars */]) = 
-1 ENOENT (No such file or directory)
21885 execve(/usr/local/sbin/locale, [locale, charmap], [/* 29 vars */]) 
= -1 ENOENT (No such file or directory)
21885 execve(/usr/bin/locale, [locale, charmap], [/* 29 vars */] 
unfinished ...
21885 ... execve resumed )= 0
21886 execve(/bin/sh, [sh, -c, stty -a 2/dev/null], [/* 29 vars */] 
unfinished ...
21886 ... execve resumed )= 0
21887 execve(/bin/stty, [stty, -a], [/* 29 vars */]) = 0
21888 execve(/bin/sh, [sh, -c, stty -a 2/dev/null], [/* 29 vars */] 
unfinished ...
21888 ... execve resumed )= 0
21889 execve(/bin/stty, [stty, -a], [/* 29 vars */]) = 0
21890 execve(/var/lib/dpkg/info/linux-image-2.6.18-1-686.postrm.real, 
[/var/lib/dpkg/info/linux-image-2..., purge], [/* 30 vars */]) = 0
21891 execve(/sbin/update-grub, [/sbin/update-grub, 2.6.18-1-686, 
/boot/vmlinuz-2.6.18-1-686], [/* 32 vars */]) = 0
21892 execve(/bin/grep, [grep, -q,   */sbin/update-grub$, 
/etc/kernel-img.conf], [/* 32 vars */]) = 0
21891 execve(/usr/sbin/update-grub, [/usr/sbin/update-grub, 2.6.18-1-686, 
/boot/vmlinuz-2.6.18-1-686], [/* 32 vars */]) = 0
21894 execve(/bin/uname, [uname, -s], [/* 31 vars */] unfinished ...
21895 execve(/usr/bin/tr, [tr, [A-Z], [a-z]], [/* 31 vars */] 
unfinished ...
21894 ... execve resumed )= 0
21895 ... execve resumed )= 0
21901 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21901 ... execve resumed )= 0
21904 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21904 ... execve resumed )= 0
21907 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */]) = 0
21910 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21910 ... execve resumed )= 0
21913 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */]) = 0
21916 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21916 ... execve resumed )= 0
21919 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21919 ... execve resumed )= 0
21922 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */]) = 0
21925 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21925 ... execve resumed )= 0
21928 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21928 ... execve resumed )= 0
21931 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21931 ... execve resumed )= 0
21934 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21934 ... execve resumed )= 0
21937 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21937 ... execve resumed )= 0
21940 execve(/bin/grep, [grep, -q, ^#], [/* 31 vars */] unfinished ...
21940 ... execve resumed )= 0
21942 execve(/bin/readlink, [readlink, -f, /dev/md3], [/* 31 vars */]) 
= 0
21946 execve(/bin/grep

Re: Observations of a failing disk

2006-11-27 Thread dean gaudet

On Tue, 28 Nov 2006, Richard Scobie wrote:

 Anyway, my biggest concern is why
 
 echo repair  /sys/block/md5/md/sync_action
 
 appeared to have no effect at all, when I understand that it should re-write
 unreadable sectors?

i've had the same thing happen on a seagate 7200.8 pata 400GB... and went 
through the same sequence of operations you described, and the dd fixed 
it.

one theory was that i lucked out and the pending sectors in the unused 
disk near the md superblock... but since that's in general only about 90KB 
of disk i was kind of skeptical.  it's certainly possible, but seems 
unlikely.

another theory is that a pending sector doesn't always result in a read 
error -- i.e. depending on temperature?  but the question is, why wouldn't 
the disk try rewriting it if it does get a successful read.

i wish hard drives were a little less voodoo.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: gratuitous arp

2006-11-26 Thread dean gaudet

On Sun, 26 Nov 2006, James Courtier-Dutton wrote:

 dean gaudet wrote:
  On Sun, 26 Nov 2006, James Courtier-Dutton wrote:
  
   dean gaudet wrote:
hi...

i ran into some problems recently which would have been avoided if my
box
did a gratuitous arp as it brought up all interfaces (the router took
forever to timeout the ARP entries for interface aliases).  so i set
about
looking to see why that wasn't happening.
  ...
   Are you 100% sure about this?
   Have you done a packet sniff on the network?
   A lot of routers ignore gratuitous arp for security reasons.
  
  yeah i've done some packet sniffing to verify this.
  
  here's what happened (twice now):  i upgraded a (normally busy) box, so the
  MAC address changed.  the router is a cisco (not managed by me).
  
  debian reboot sequence at some point brings up the primary eth0 address and
  very soon thereafter there will be an arp who-has $default_gw tell
  $primary_addr.  that's sufficient to get the cisco to update its ARP cache
  for $primary_addr.  this isn't gratuitous arp, but does the trick for the
  $primary_addr.
  
  but there's no gratuitous arp for any eth0:N aliased interfaces... and the
  cisco ARP cache on this ISP router seems to be set to a long timeout.  i
  could reach eth0:N from local net, but couldn't get outside local net from
  eth0:N.
  
  issuing arping -I eth0 -s $secondary_addr $default_gw for each secondary
  address updated the cisco ARP cache and i could then reach eth0:N remotely.
  
  so... that may not be exactly gratuitous arp, but basically i was stuck
  until i forced the cisco to update its ARP cache for each of the secondary
  addrs...
  
  it seems to me it'd be nice for the init sequence to take care of this, so
  that other folks don't have to spend time debugging similar problems.  i
  just wanted to ask if i'm missing something obvious before i go open a
  debian bug.  (i'm tempted to see if fedora does anything differently.)
  
  thanks
  -dean
 
 Ok, I think it is better to just do gratuitous arp on the primary interface.
 If one starts doing it on secondary interfaces, one would then have to also do
 it for all proxy-arp addresses(if used), and thinks could start getting rather
 messy.

the primary address (the address which is used as the source address for 
all ARP packets) didn't need a gratuitous ARP because it sent a real ARP 
request to find the default gateway's MAC addr.

it was all the rest of the addresses which were screwed (which i'll call 
secondary just because they're not the ones which are used in ARP 
requests, and aren't the ones used as default addresses for IN_ADDR_ANY 
sockets).

but yeah, i can see an ARP storm nightmare if every address does it at the 
same time at boot... with the likely result of the cisco dropping some 
(especially because i'm sure ARP is on the slow path through the generally 
weak cpu in a cisco router).

ugh, this does seem a rather specialized problem, and manually fixing it 
with arping/garp/send_arp seems most appropriate.

i pondered a daemon which would use libpcap to observe traffic for a while 
and look at outbound packets which aren't seeing inbound responses and 
then try to help with a directed ARP... and would stop after a few 
minutes... but it's so special purpose it's just silly.  it's useful only 
for a machine upgrade in the presence of silly default 4h ARP cache 
timeouts or for IP failover without MAC failover and in the presence of 
boxes which ignore grat arp.

-dean
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

gratuitous arp

2006-11-25 Thread dean gaudet

hi...

i ran into some problems recently which would have been avoided if my box 
did a gratuitous arp as it brought up all interfaces (the router took 
forever to timeout the ARP entries for interface aliases).  so i set about 
looking to see why that wasn't happening.

i either missed it, or there's no code in the kernel to do it -- but 
that's cool, because it's easy enough to do from userland.  i'm guessing 
this is the intention.

however my debian and ubuntu boxes aren't doing grat arp and don't seem to 
have options to do it (i do know about using various other tools such as 
arping, send_arp, garp to do it manually).

before i go opening bugs with the distribution folks, could someone chime 
in as to what is the recommended approach these days?  did grat arp fall 
out of favour, or is it just a case of userland not keeping up?

thanks
-dean
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: gratuitous arp

2006-11-25 Thread dean gaudet

On Sun, 26 Nov 2006, James Courtier-Dutton wrote:

 dean gaudet wrote:
  hi...
  
  i ran into some problems recently which would have been avoided if my box
  did a gratuitous arp as it brought up all interfaces (the router took
  forever to timeout the ARP entries for interface aliases).  so i set about
  looking to see why that wasn't happening.
...
 
 Are you 100% sure about this?
 Have you done a packet sniff on the network?
 A lot of routers ignore gratuitous arp for security reasons.

yeah i've done some packet sniffing to verify this.

here's what happened (twice now):  i upgraded a (normally busy) box, so 
the MAC address changed.  the router is a cisco (not managed by me).

debian reboot sequence at some point brings up the primary eth0 address 
and very soon thereafter there will be an arp who-has $default_gw tell 
$primary_addr.  that's sufficient to get the cisco to update its ARP 
cache for $primary_addr.  this isn't gratuitous arp, but does the trick 
for the $primary_addr.

but there's no gratuitous arp for any eth0:N aliased interfaces... and the 
cisco ARP cache on this ISP router seems to be set to a long timeout.  i 
could reach eth0:N from local net, but couldn't get outside local net from 
eth0:N.

issuing arping -I eth0 -s $secondary_addr $default_gw for each secondary 
address updated the cisco ARP cache and i could then reach eth0:N 
remotely.

so... that may not be exactly gratuitous arp, but basically i was stuck 
until i forced the cisco to update its ARP cache for each of the secondary 
addrs...

it seems to me it'd be nice for the init sequence to take care of this, so 
that other folks don't have to spend time debugging similar problems.  i 
just wanted to ask if i'm missing something obvious before i go open a 
debian bug.  (i'm tempted to see if fedora does anything differently.)

thanks
-dean
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [rdiff-backup-users] What happens when you move files around?

2006-11-25 Thread dean gaudet

as a one time hack, if you want to reduce the network bandwidth required 
to move the 1GB file you could hardlink the oldfile to the newfile (and 
leave the oldfile around for now) and then do a backup... rdiff-backup 
will detect the hardlink and just hardlink the destination file in the 
mirror without retransmitting the file.

however when you delete the oldfile and do another backup then you'll end 
up with a reverse delta for the 1GB file... rdiff-backup has no way to say 
create it from this other file over here.

so basically you can just save yourself some network bandwidth, but not 
backup-server disk space.

it really does beg the question... why can't rdiff-backup do this itself?  
in general keeping track of all dev:inode pairs in a backup can consume 
way too much memory (think 10s of millions of inodes on some 
filesystems)... however it would seem possible to look around in the same 
directory for a renamed file (i.e. log file rotation).

i hack around this by rotating my log files by date.

http://arctic.org/~dean/scripts/date-rotate

-dean

On Sat, 25 Nov 2006, roland wrote:

 afaik, rdiff-backup doesn`t detect file moves, so the file will be backed up
 another time and another GB of storage is needed for that.
 
 regards
 roland
 
 - Original Message - From: Karjala [EMAIL PROTECTED]
 To: rdiff-backup-users@nongnu.org
 Sent: Saturday, November 25, 2006 7:25 PM
 Subject: [rdiff-backup-users] What happens when you move files around?
 
 
  On my system I move files around directories a bit. How are files that have
  been moved since the last backup treated by rdiff-backup? I.e. if I rename a
  1GB file that has already been backed-up, will one extra GB of backup
  storage be taken up on the next automated backup? Also what would happen to
  the amount of backup storage required, if I moved a 1GB file to another
  directory (which will also be backed up on the next backup)?
  
  Thanks,
  - K.
  
  
  ___
  rdiff-backup-users mailing list at rdiff-backup-users@nongnu.org
  http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
  Wiki URL:
  http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki 
 
 
 
 ___
 rdiff-backup-users mailing list at rdiff-backup-users@nongnu.org
 http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
 Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki
 


___
rdiff-backup-users mailing list at rdiff-backup-users@nongnu.org
http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki

Re: Raid 1 (non) performance

2006-11-19 Thread dean gaudet

On Wed, 15 Nov 2006, Magnus Naeslund(k) wrote:

 # cat /proc/mdstat
 Personalities : [raid1]
 md2 : active raid1 sda3[0] sdb3[1]
   236725696 blocks [2/2] [UU]
 
 md1 : active raid1 sda2[0] sdb2[1]
   4192896 blocks [2/2] [UU]
 
 md0 : active raid1 sda1[0] sdb1[1]
   4192832 blocks [2/2] [UU]

i see you have split /var and / on the same spindle... if your /home is on 
/ then you're causing extra seek action by having two active filesystems 
on the same spindles.  another option to consider is to make / small and 
mostly read-only and move /home to /var/home (and use a symlink or mount 
--bind to place it at /home).

or just put everything in one big / filesystem.

hopefully your swap isn't being used much anyhow.

try iostat -kx /dev/sd* 5 and see if the split is causing you troubles 
-- i/o activity on more than one partition at once.


 I've tried to modify the queuing by doing this, to disable the write cache 
 and enable CFQ. The CFQ choice is rather random.
 
 for disk in sda sdb; do
   blktool /dev/$disk wcache off
   hdparm -q -W 0 /dev/$disk

turning off write caching is a recipe for disasterous performance on most 
ata disks... unfortunately.  better to buy a UPS and set up nut or apcupsd 
or something to handle shutdown.  or just take your chances.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: safest way to swap in a new physical disk

2006-11-18 Thread dean gaudet

On Tue, 14 Nov 2006, Will Sheffler wrote:

 Hi.
 
 What is the safest way to switch out a disk in a software raid array created
 with mdadm? I'm not talking about replacing a failed disk, I want to take a
 healthy disk in the array and swap it for another physical disk. Specifically,
 I have an array made up of 10 250gb software-raid partitions on 8 300gb disks
 and 2 250gb disks, plus a hot spare. I want to switch the 250s to new 300gb
 disks so everything matches. Is there a way to do this without risking a
 rebuild? I can't back everything up, so I want to be as risk-free as possible.
 
 I guess what I want is to do something like this:
 
 (1) Unmount the array
 (2) Un-create the array
 (3) Somehow exactly duplicate partition X to a partition Y on a new disk
 (4) Re-create array with X gone and Y in it's place
 (5) Check if the array is OK without changing/activating it
 (6) If there is a problem, switch from Y back to X and have it as though
 nothing changed
 
 The part I'm worried about is (3), as I've tried duplicating partition images
 before and it never works right. Is there a way to do this with mdadm?

if you have a recent enough kernel (2.6.15 i think) and recent enough 
mdadm (2.2.x i think) you can do this all online without losing redundancy 
for more than a few seconds... i placed a copy of instructions and further 
discussions of what types of problems this method has here:

http://arctic.org/~dean/proactive-raid5-disk-replacement.txt

it's actually perfect for your situation.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bug#355178: [#355178] unable to reproduce the 4GB librsync1 problem

2006-11-18 Thread dean gaudet

On Sat, 18 Nov 2006, Michael Prokop wrote:

 So can you please provide the necessary steps to reproduce the problem?

iirc it doesn't happen on every file 4GB.

try between a 32-bit and a 64-bit host -- that's when it was hitting me 
the worst.

-dean


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Bug#399271: post(8) segfaulting

2006-11-18 Thread dean gaudet

Package: nmh
Version: 1.2-1

in 1.2-1 post(8) is segfaulting (amd64)... doesn't happen with same config 
on 1.1-release-4.

if i get a chance i'll grab a gdb backtrace... but maybe this strace will 
help.

oh maybe my mts.conf will help too:

# grep -ve '^#.*' -e '^$' /etc/nmh/mts.conf
mts: smtp
hostable: /etc/nmh/hosts
masquerade: draft_from
mmdfldir: /var/mail
mmdflfil:
servers: localhost

-dean

12643 execve(/usr/lib/mh/post, [post, -library, /home/dean/Mail, 
-alias, aliases, /home/dean/Mail/drafts/1], [/* 36 vars */]) = 0
12643 uname({sys=Linux, node=twinlark.arctic.org, ...}) = 0
12643 brk(0)= 0x54d000
12643 access(/etc/ld.so.nohwcap, F_OK) = -1 ENOENT (No such file or directory)
12643 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= 0x2b655ea01000
12643 access(/etc/ld.so.preload, R_OK) = -1 ENOENT (No such file or directory)
12643 open(/etc/ld.so.cache, O_RDONLY) = 3
12643 fstat(3, {st_mode=S_IFREG|0644, st_size=62339, ...}) = 0
12643 mmap(NULL, 62339, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2b655ea03000
12643 close(3)  = 0
12643 access(/etc/ld.so.nohwcap, F_OK) = -1 ENOENT (No such file or directory)
12643 open(/usr/lib/libsasl2.so.2, O_RDONLY) = 3
12643 read(3, \177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\0\1\0\0\0\300H\0\0..., 
640) = 640
12643 fstat(3, {st_mode=S_IFREG|0644, st_size=103096, ...}) = 0
12643 mmap(NULL, 115, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) 
= 0x2b655eb02000
12643 mprotect(0x2b655eb1a000, 1051696, PROT_NONE) = 0
12643 mmap(0x2b655ec1a000, 4096, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x18000) = 0x2b655ec1a000
12643 close(3)  = 0
12643 access(/etc/ld.so.nohwcap, F_OK) = -1 ENOENT (No such file or directory)
12643 open(/usr/lib/libdb-4.5.so, O_RDONLY) = 3
12643 read(3, \177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\0\1\0\0\0\200Z\2\0..., 
640) = 640
12643 fstat(3, {st_mode=S_IFREG|0644, st_size=1204488, ...}) = 0
12643 mmap(NULL, 2252248, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) 
= 0x2b655ec1b000
12643 mprotect(0x2b655ed3c000, 1068504, PROT_NONE) = 0
12643 mmap(0x2b655ee3c000, 20480, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x121000) = 0x2b655ee3c000
12643 close(3)  = 0
12643 access(/etc/ld.so.nohwcap, F_OK) = -1 ENOENT (No such file or directory)
12643 open(/usr/lib/liblockfile.so.1, O_RDONLY) = 3
12643 read(3, \177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\0\1\0\0\0`\23\0\0..., 640) 
= 640
12643 fstat(3, {st_mode=S_IFREG|0644, st_size=11512, ...}) = 0
12643 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= 0x2b655ee41000
12643 mmap(NULL, 1058144, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) 
= 0x2b655ee42000
12643 mprotect(0x2b655ee45000, 1045856, PROT_NONE) = 0
12643 mmap(0x2b655ef44000, 4096, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x2b655ef44000
12643 close(3)  = 0
12643 access(/etc/ld.so.nohwcap, F_OK) = -1 ENOENT (No such file or directory)
12643 open(/lib/libc.so.6, O_RDONLY)  = 3
12643 read(3, \177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\0\1\0\0\0\200\305..., 640) 
= 640
12643 lseek(3, 624, SEEK_SET)   = 624
12643 read(3, \4\0\0\0\20\0\0\0\1\0\0\0GNU\0\0\0\0\0\2\0\0\0\6\0\0\0..., 32) 
= 32
12643 fstat(3, {st_mode=S_IFREG|0755, st_size=1286312, ...}) = 0
12643 mmap(NULL, 2344904, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) 
= 0x2b655ef45000
12643 mprotect(0x2b655f066000, 1161160, PROT_NONE) = 0
12643 mmap(0x2b655f166000, 98304, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x121000) = 0x2b655f166000
12643 mmap(0x2b655f17e000, 14280, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2b655f17e000
12643 close(3)  = 0
12643 access(/etc/ld.so.nohwcap, F_OK) = -1 ENOENT (No such file or directory)
12643 open(/lib/libdl.so.2, O_RDONLY) = 3
12643 read(3, \177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\0\1\0\0\0\0\20\0\0..., 
640) = 640
12643 fstat(3, {st_mode=S_IFREG|0644, st_size=10392, ...}) = 0
12643 mmap(NULL, 1057000, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) 
= 0x2b655f182000
12643 mprotect(0x2b655f184000, 1048808, PROT_NONE) = 0
12643 mmap(0x2b655f283000, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1000) = 0x2b655f283000
12643 close(3)  = 0
12643 access(/etc/ld.so.nohwcap, F_OK) = -1 ENOENT (No such file or directory)
12643 open(/lib/libresolv.so.2, O_RDONLY) = 3
12643 read(3, \177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\0\1\0\0\0\2207\0\0..., 
640) = 640
12643 fstat(3, {st_mode=S_IFREG|0644, st_size=76600, ...}) = 0
12643 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= 0x2b655f285000
12643 mmap(NULL, 1133320, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) 
= 0x2b655f286000
12643 mprotect(0x2b655f297000, 1063688, PROT_NONE) = 0
12643 mmap(0x2b655f397000, 8192,

Re: [rdiff-backup-users] Suggestion for documentation change

2006-11-18 Thread dean gaudet

On Sat, 18 Nov 2006, Andrew Ferguson wrote:

 dean gaudet wrote:
  sounds like the bug is that rdiff-backup decides there's a metadata change 
  and stores an almost-empty .diff.gz file even though it's not required.  
  even though the metadata change is innocuous...
 
 I think there could be a reason for the almost-empty .diff.gz file, but
 it depends on whether we view metadata changes as true changes to a
 file. That is, do we truly want to see that a file 'changed' at a
 specific backup time, or do we just want that implicit in the metadata?
 
 This is connected to the fact that you can do restores with rdiff-backup
 of the form:
 
 rdiff-backup
 /backup/rdiff-backup-data/increments/path/to/file.time.diff.gz
 /path/to/file
 
 If we stop creating the .diff.gz files for metadata-only changes, we
 break this behavior for the case when you want to restore on one side or
 the the other of a metadata-only change.
 
 Does this make sense?

yep -- but we could store an actual 0-length file instead... so we're not 
wasting an entire disk block on many filesystems.  better to name it 
.nodiff or something else so we can distinguish between an incompletely 
written .diff.gz and a file with no differences.

there's another case of frequently changing metadata -- if you backup from 
an LVM snapshot the device number changes now and then (because it's 
dynamic).  the inode remains the same... this used to cause me a lot of 
disk space wastage but i stopped using LVM and so the problem dropped in 
priority for me.

-dean


___
rdiff-backup-users mailing list at rdiff-backup-users@nongnu.org
http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki

Re: [rdiff-backup-users] Suggestion for documentation change

2006-11-18 Thread dean gaudet

On Sat, 18 Nov 2006, dean gaudet wrote:

 yep -- but we could store an actual 0-length file instead... so we're not 
 wasting an entire disk block on many filesystems.  better to name it 
 .nodiff or something else so we can distinguish between an incompletely 
 written .diff.gz and a file with no differences.
 
 there's another case of frequently changing metadata -- if you backup from 
 an LVM snapshot the device number changes now and then (because it's 
 dynamic).  the inode remains the same... this used to cause me a lot of 
 disk space wastage but i stopped using LVM and so the problem dropped in 
 priority for me.

actually this is probably going to be more of a problem in the newfangled 
world of dynamically assigned device numbers.

i wonder if there's any sane API we can use to get the UUID of a 
filesystem.

-dean


___
rdiff-backup-users mailing list at rdiff-backup-users@nongnu.org
http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki

Re: [rdiff-backup-users] [PATCH] Log symlink creation

2006-11-18 Thread dean gaudet

On Thu, 16 Nov 2006, Gordon Rowell wrote:

 Has anyone looked at storing symlink metadata separately for filesystems
 which don't support symlinks? In particular, when backing up to a CIFS
 fileystem. This currently logs a SpecialFileError, which is not surprising.

yeah it would be desirable to store the symlink info in the metadata.  
and desirable to add an fs_abilities test for symlink capabilities...

i wasn't sure about your patch -- it just adds more logging?

-dean


___
rdiff-backup-users mailing list at rdiff-backup-users@nongnu.org
http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki

RE: touch_cache() only touches two thirds

2006-11-17 Thread dean gaudet

On Fri, 17 Nov 2006, dean gaudet wrote:

> another pointer chase arranged to fill the L1 (or L2) using many many 
> pages.  i.e. suppose i wanted to traverse 32KiB L1 with 64B cache lines 
> then i'd allocate 512 pages and put one line on each page (pages ordered 
> randomly), but colour them so they fill the L1.  this conveniently happens 
> to fit in a 2MiB huge page on x86, so you could even ameliorate the TLB 
> pressure from the microbenchmark.

btw, for L2-sized measurements you don't need to be so clever... you can 
get away with a random traversal of the L2 on 128B boundaries.  (need to 
avoid the "next-line prefetch" issues on p-m/core/core2, p4 model 3 and 
later.)  there's just so many more pages required to map the L2 than any 
reasonable prefetcher is going to have any time soon.

-dean


> the benchmark i was considering would be like so:
> 
>   switch to cpu m
>   scrub the cache
>   switch to cpu n
>   scrub the cache
>   traverse the coloured list and modify each cache line as we go
>   switch to cpu m
>   start timing
>   traverse the coloured list without modification
>   stop timing
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: touch_cache() only touches two thirds

2006-11-17 Thread dean gaudet

On Fri, 10 Nov 2006, Bela Lubkin wrote:

> The corrected code in 
> covers the full cache range.  Granted that modern CPUs may be able to track
> multiple simultaneous cache access streams: how many such streams are they
> likely to be able to follow at once?  It seems like going from 1 to 2 would
> be a big win, 2 to 3 a small win, beyond that it wouldn't likely make much
> incremental difference.  So what do the actual implementations in the field
> support?

p-m family, core, core2 track one stream on each of 12 to 16 pages.  in 
the earlier ones they split the trackers into some forward-only and some 
backward-only, but on core2 i think they're all bidirectional.  if i had 
to guess they round-robin the trackers, so once you hit 17 pages with 
streams they're defeated.

a p4 (0f0403, probably "prescott") i have here is tracking 16 -- seems to 
use LRU or pLRU but i haven't tested really, you need to get out past 32 
streams before it really starts falling off... and even then the next-line 
prefetch in the L2 helps too much (64-byte lines, but 128-byte tags and a 
pair of dirty/state bits -- it prefetches the other half of a pair 
automatically).  oh it can track forward or backward, and is happy with 
strides up to 128.

k8 rev F tracks one stream on each of 20 pages (forward or backward).  it 
also seems to use round-robin, and is defeated as soon as you have 21 
streams.

i swear there was an x86 which did 28 streams, but it was a few years ago 
that i last looked really closely at the prefetchers and i don't have 
access to the data at the moment.

i suggest that streams are the wrong approach.  i was actually considering 
this same problem this week, happy to see your thread.

the approach i was considering was to set up two pointer chases:

one pointer chase covering enough cache lines (and in a prefetchable 
ordering) for "scrubbing" the cache(s).

another pointer chase arranged to fill the L1 (or L2) using many many 
pages.  i.e. suppose i wanted to traverse 32KiB L1 with 64B cache lines 
then i'd allocate 512 pages and put one line on each page (pages ordered 
randomly), but colour them so they fill the L1.  this conveniently happens 
to fit in a 2MiB huge page on x86, so you could even ameliorate the TLB 
pressure from the microbenchmark.

you can actually get away with a pointer every 256 bytes today -- none of 
the prefetchers on today's x86 cores consider a 256 byte stride to be 
prefetchable.  for safety you might want to use 512 byte alignment... this 
lets you get away with fewer pages for colouring larger caches.

the benchmark i was considering would be like so:

switch to cpu m
scrub the cache
switch to cpu n
scrub the cache
traverse the coloured list and modify each cache line as we go
switch to cpu m
start timing
traverse the coloured list without modification
stop timing

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: How to interpret MCE messages?

2006-11-17 Thread dean gaudet

On Wed, 15 Nov 2006, martin f krafft wrote:

> Thus I guess the CPU is asking for retirement. I am just
> double-checking with you guys whether I can be sure that it's only
> the CPU, or whether it could also be the fault of the motherboard...

could be VRMs and/or PSU delivering unclean power... but you'd probably 
see other errors in that case too.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: How to interpret MCE messages?

2006-11-17 Thread dean gaudet

On Wed, 15 Nov 2006, martin f krafft wrote:

 Thus I guess the CPU is asking for retirement. I am just
 double-checking with you guys whether I can be sure that it's only
 the CPU, or whether it could also be the fault of the motherboard...

could be VRMs and/or PSU delivering unclean power... but you'd probably 
see other errors in that case too.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: touch_cache() only touches two thirds

2006-11-17 Thread dean gaudet

On Fri, 10 Nov 2006, Bela Lubkin wrote:

 The corrected code in http://bugzilla.kernel.org/show_bug.cgi?id=7476#c4
 covers the full cache range.  Granted that modern CPUs may be able to track
 multiple simultaneous cache access streams: how many such streams are they
 likely to be able to follow at once?  It seems like going from 1 to 2 would
 be a big win, 2 to 3 a small win, beyond that it wouldn't likely make much
 incremental difference.  So what do the actual implementations in the field
 support?

p-m family, core, core2 track one stream on each of 12 to 16 pages.  in 
the earlier ones they split the trackers into some forward-only and some 
backward-only, but on core2 i think they're all bidirectional.  if i had 
to guess they round-robin the trackers, so once you hit 17 pages with 
streams they're defeated.

a p4 (0f0403, probably prescott) i have here is tracking 16 -- seems to 
use LRU or pLRU but i haven't tested really, you need to get out past 32 
streams before it really starts falling off... and even then the next-line 
prefetch in the L2 helps too much (64-byte lines, but 128-byte tags and a 
pair of dirty/state bits -- it prefetches the other half of a pair 
automatically).  oh it can track forward or backward, and is happy with 
strides up to 128.

k8 rev F tracks one stream on each of 20 pages (forward or backward).  it 
also seems to use round-robin, and is defeated as soon as you have 21 
streams.

i swear there was an x86 which did 28 streams, but it was a few years ago 
that i last looked really closely at the prefetchers and i don't have 
access to the data at the moment.

i suggest that streams are the wrong approach.  i was actually considering 
this same problem this week, happy to see your thread.

the approach i was considering was to set up two pointer chases:

one pointer chase covering enough cache lines (and in a prefetchable 
ordering) for scrubbing the cache(s).

another pointer chase arranged to fill the L1 (or L2) using many many 
pages.  i.e. suppose i wanted to traverse 32KiB L1 with 64B cache lines 
then i'd allocate 512 pages and put one line on each page (pages ordered 
randomly), but colour them so they fill the L1.  this conveniently happens 
to fit in a 2MiB huge page on x86, so you could even ameliorate the TLB 
pressure from the microbenchmark.

you can actually get away with a pointer every 256 bytes today -- none of 
the prefetchers on today's x86 cores consider a 256 byte stride to be 
prefetchable.  for safety you might want to use 512 byte alignment... this 
lets you get away with fewer pages for colouring larger caches.

the benchmark i was considering would be like so:

switch to cpu m
scrub the cache
switch to cpu n
scrub the cache
traverse the coloured list and modify each cache line as we go
switch to cpu m
start timing
traverse the coloured list without modification
stop timing

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: touch_cache() only touches two thirds

2006-11-17 Thread dean gaudet

On Fri, 17 Nov 2006, dean gaudet wrote:

 another pointer chase arranged to fill the L1 (or L2) using many many 
 pages.  i.e. suppose i wanted to traverse 32KiB L1 with 64B cache lines 
 then i'd allocate 512 pages and put one line on each page (pages ordered 
 randomly), but colour them so they fill the L1.  this conveniently happens 
 to fit in a 2MiB huge page on x86, so you could even ameliorate the TLB 
 pressure from the microbenchmark.

btw, for L2-sized measurements you don't need to be so clever... you can 
get away with a random traversal of the L2 on 128B boundaries.  (need to 
avoid the next-line prefetch issues on p-m/core/core2, p4 model 3 and 
later.)  there's just so many more pages required to map the L2 than any 
reasonable prefetcher is going to have any time soon.

-dean


 the benchmark i was considering would be like so:
 
   switch to cpu m
   scrub the cache
   switch to cpu n
   scrub the cache
   traverse the coloured list and modify each cache line as we go
   switch to cpu m
   start timing
   traverse the coloured list without modification
   stop timing
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rdiff-backup-users] Suggestion for documentation change

2006-11-17 Thread dean gaudet

sounds like the bug is that rdiff-backup decides there's a metadata change 
and stores an almost-empty .diff.gz file even though it's not required.  
even though the metadata change is innocuous...

seems like it would be possible to at least avoid the almost-empty patch 
file when there is a metadata-only change.

in fact the problem happens for other metadata-only changes... such as 
chown/chmod.

worth fixing if someone has the time :)

easier to fix the almost-empty .diff.gz file problem than it is to figure 
out if a filesystem lacks persistent inode information.

-dean

On Wed, 15 Nov 2006, Michael Stucki wrote:

 Hi folks,
 
 I just managed to work around a big problem, and I thought it would be good
 to let you know about:
 
 I was trying to backup a collection of mounted SMB shares on a Linux system.
 In the first run, this worked very well, but after that I've encountered a
 big problem:
 
 Although nothing changed on any of those shares, rdiff-backup incremented a
 lot of files, just randomly as it seemed to me. The diff files of these
 increments were always just a few bytes, they didn't contain any noticable
 changes inside.
 
 After all I've found out that the --no-compare-inode option solves this
 problem. As it seems, remounting a share causes a change of the inode data
 (though there is nothing stored on inodes since the system is remote...)
 
 Since the man-page didn't mention anything about this, and I didn't see
 anybody else reporting this problem, I thought it would be good to let you
 know about this little workaround...
 
 - michael
 
 
 
 ___
 rdiff-backup-users mailing list at rdiff-backup-users@nongnu.org
 http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
 Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki
 


___
rdiff-backup-users mailing list at rdiff-backup-users@nongnu.org
http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki

Re: [rdiff-backup-users] [PATCH] rdiff-backup.spec cleanups

2006-11-17 Thread dean gaudet

commited, thanks.

-dean

On Thu, 16 Nov 2006, Gordon Rowell wrote:

 - Adjust URLs
 - Add changelog entries
 
 It's a pity there is an Epoch: 0 header in the SPEC files as an Epoch of zero
 outvotes no Epoch. I'd be tempted to delete those headers and let RPM
 versioning work normally. However, now that they are there, we're sort of
 stuck with them :-(
 
 Thanks,
 
 Gordon
 -- 
 Gordon Rowell   Gormand Pty Ltdhttp://www.gormand.com.au/
 SME Server development and support http://www.smeserver.com.au/
 Perl development, systems and network consulting
 
 
 


___
rdiff-backup-users mailing list at rdiff-backup-users@nongnu.org
http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki

Re: raid5 hang on get_active_stripe

2006-11-15 Thread dean gaudet

and i haven't seen it either... neil do you think your latest patch was 
hiding the bug?  'cause there was an iteration of an earlier patch which 
didn't produce much spam in dmesg but the bug was still there, then there 
is the version below which spams dmesg a fair amount but i didn't see the 
bug in ~30 days.

btw i've upgraded that box to 2.6.18.2 without the patch (it had some 
conflicts)... haven't seen the bug yet though (~10 days so far).

hmm i wonder if i could reproduce it more rapidly if i lowered 
/sys/block/mdX/md/stripe_cache_size.  i'll give that a go.

-dean


On Tue, 14 Nov 2006, Chris Allen wrote:

 You probably guessed that no matter what I did, I never, ever saw the problem
 when your
 trace was installed. I'd guess at some obscure timing-related problem. I can
 still trigger it
 consistently with a vanilla 2.6.17_SMP though, but again only when bitmaps are
 turned on.
 
 
 
 Neil Brown wrote:
  On Tuesday October 10, [EMAIL PROTECTED] wrote:

   Very happy to. Let me know what you'd like me to do.
   
  
  Cool thanks.
  
  At the end is a patch against 2.6.17.11, though it should apply against
  any later 2.6.17 kernel.
  Apply this and reboot.
  
  Then run
  
 while true
 do cat /sys/block/mdX/md/stripe_cache_active
sleep 10
 done  /dev/null
  
  (maybe write a little script or whatever).  Leave this running. It
  effects the check for has raid5 hung.  Make sure to change mdX to
  whatever is appropriate.
  
  Occasionally look in the kernel logs for
 plug problem:
  
  if you find that, send me the surrounding text - there should be about
  a dozen lines following this one.
  
  Hopefully this will let me know which is last thing to happen: a plug
  or an unplug.
  If the last is a plug, then the timer really should still be
  pending, but isn't (this is impossible).  So I'll look more closely at
  that option.
  If the last is an unplug, then the 'Plugged' flag should really be
  clear but it isn't (this is impossible).  So I'll look more closely at
  that option.
  
  Dean is running this, but he only gets the hang every couple of
  weeks.  If you get it more often, that would help me a lot.
  
  Thanks,
  NeilBrown
  
  
  diff ./.patches/orig/block/ll_rw_blk.c ./block/ll_rw_blk.c
  --- ./.patches/orig/block/ll_rw_blk.c   2006-08-21 09:52:46.0 
  +1000
  +++ ./block/ll_rw_blk.c 2006-10-05 11:33:32.0 +1000
  @@ -1546,6 +1546,7 @@ static int ll_merge_requests_fn(request_
* This is called with interrupts off and no requests on the queue and
* with the queue lock held.
*/
  +static atomic_t seq = ATOMIC_INIT(0);
   void blk_plug_device(request_queue_t *q)
   {
  WARN_ON(!irqs_disabled());
  @@ -1558,9 +1559,16 @@ void blk_plug_device(request_queue_t *q)
  return;
  if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, q-queue_flags)) {
  +   q-last_plug = jiffies;
  +   q-plug_seq = atomic_read(seq);
  +   atomic_inc(seq);
  mod_timer(q-unplug_timer, jiffies + q-unplug_delay);
  blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG);
  -   }
  +   } else
  +   q-last_plug_skip = jiffies;
  +   if (!timer_pending(q-unplug_timer) 
  +   !q-unplug_work.pending)
  +   printk(Neither Timer or work are pending\n);
   }
EXPORT_SYMBOL(blk_plug_device);
  @@ -1573,10 +1581,17 @@ int blk_remove_plug(request_queue_t *q)
   {
  WARN_ON(!irqs_disabled());
   -  if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, q-queue_flags))
  +   if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, q-queue_flags)) {
  +   q-last_unplug_skip = jiffies;
  return 0;
  +   }
  del_timer(q-unplug_timer);
  +   q-last_unplug = jiffies;
  +   q-unplug_seq = atomic_read(seq);
  +   atomic_inc(seq);
  +   if (test_bit(QUEUE_FLAG_PLUGGED, q-queue_flags))
  +   printk(queue still (or again) plugged\n);
  return 1;
   }
   @@ -1635,7 +1650,7 @@ static void blk_backing_dev_unplug(struc
   static void blk_unplug_work(void *data)
   {
  request_queue_t *q = data;
  -
  +   q-last_unplug_work = jiffies;
  blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
  q-rq.count[READ] + q-rq.count[WRITE]);
   @@ -1649,6 +1664,7 @@ static void blk_unplug_timeout(unsigned
  blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL,
  q-rq.count[READ] + q-rq.count[WRITE]);
   +  q-last_unplug_timeout = jiffies;
  kblockd_schedule_work(q-unplug_work);
   }
   
  diff ./.patches/orig/drivers/md/raid1.c ./drivers/md/raid1.c
  --- ./.patches/orig/drivers/md/raid1.c  2006-08-10 17:28:01.0
  +1000
  +++ ./drivers/md/raid1.c2006-09-04 21:58:31.0 +1000
  @@ -1486,7 +1486,6 @@ static void raid1d(mddev_t *mddev)
  d = conf-raid_disks;
  d--;
  rdev =

Re: [rdiff-backup-users] --check-destination-dir fails after a crash (redux)

2006-11-15 Thread dean gaudet

as i mentioned at some other point when this was asked... i'm really not 
100% certain that change is the right change... that's why i never applied 
it to cvs.

maybe someone could look at it more closely?

-dean

On Wed, 15 Nov 2006, Gordon Rowell wrote:

 Gordon Rowell wrote:
 
 Bumping this thread again for Dean's consideration. I have applied this
 patch to the 1.1.7 RPM I made as it does solve one of the crash
 regression issues for me.
 
 Dean - comments?
 
 Gordon
 
  Hi everyone,
  
  I experienced the error shown here:
  
 
  http://lists.nongnu.org/archive/html/rdiff-backup-users/2006-03/msg00038.html
   
  
  and applied the patch show here:
  
 
  http://lists.nongnu.org/archive/html/rdiff-backup-users/2006-03/msg00039.html
   
  
  which worked for me, but I haven't seen any further follow-up and the patch
  doesn't seem to have been applied in CVS.
  
  Is it the correct fix?
  
  Thanks,
  
  Gordon
 
 
 
 
 ___
 rdiff-backup-users mailing list at rdiff-backup-users@nongnu.org
 http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
 Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki
 


___
rdiff-backup-users mailing list at rdiff-backup-users@nongnu.org
http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki

Bug#398312: INITRDSTART='none' doesn't work

2006-11-13 Thread dean gaudet

On Mon, 13 Nov 2006, martin f krafft wrote:

 severity 398312 important
 tags 398312 unreproducible moreinfo
 thanks
 
  even though i have INITRDSTART='none' in my /etc/default/mdadm and rebuilt 
 the initrd, it still goes and does array discovery at boot time.
 
 piper:/tmp/cdt.d.Ns8889# grep '^INITRD' /etc/default/mdadm
 INITRDSTART='none'
 piper:/tmp/cdt.d.Ns8889# update-initramfs -c -b . -k $(uname -r)
 update-initramfs: Generating ./initrd.img-2.6.18-2-amd64
 W: mdadm: /etc/mdadm/mdadm.conf defines no arrays.
 I: mdadm: using configuration file: /etc/mdadm/mdadm.conf
 I: mdadm: no MD arrays will be started from the initial ramdisk.
 I: mdadm: use `dpkg-reconfigure --priority=low mdadm` to change this.

try again with VERBOSE=false...

which causes the is_true() in info() to return 1 which causes the set -e 
to terminate the script.

perhaps this bug report should be titled set -e considered harmful :)

set -e would be better if it actually caused sh to complain when the
error occured... instead it's the worst of both worlds:  scripts exit
due to programming errors and then you have no idea they exited early.


  this is marked grave because it can cause dataloss if drives with
  stale superblocks are put together in an unexpected manner
  resulting in an array rebuild.  (i.e. same reasoning as #398310)
 
 Again, I don't see this as a grave bug but human error. I agree that
 mdadm should do something against it, but it's not a grave problem
 every time that it fails to prevent human error.

i dunno, it's not really a human error to not know anything at all about 
the superblocks.

with the default settings of INITRDSTART='all' it's impossible for a
person to stick some old PATA disks (which happen to be part of an old
array) into their box.  because it requires a reboot and then initrd
will make an array and md might start trying to rebuild the disks.  i
can't even stick them in to do --zero-superblock...

unless i change INITRDSTART setting and rebuild initrd.

is it really that hard to start only the root array?  i suppose it is a
challenge on an upgrade... because you don't have the helpful sysfs
features that newer 2.6.x kernels provide for finding dependencies.
blah.

-dean

--- /var/tmp/mdadm.orig 2006-11-13 01:28:46.0 -0800
+++ /usr/share/initramfs-tools/hooks/mdadm  2006-11-13 01:51:37.0 
-0800
@@ -23,14 +23,6 @@
 ;;
 esac
 
-is_true()
-{
-  case ${1:-} in
-[Yy]es|[Yy]|1|[Tt]rue|[Tt]) return 0;;
-*) return 1;
-  esac
-}
-
 write()
 {
   local PREFIX; PREFIX=$1; shift
@@ -39,7 +31,9 @@
 
 info()
 {
-  is_true ${VERBOSE:-false}  write I $@
+  if [ $VERBOSE = true ]; then
+   write I $@
+  fi
 }
 
 warn()


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Bug#398312: Re: Bug#398312: INITRDSTART='none' doesn't work

2006-11-13 Thread dean gaudet

On Mon, 13 Nov 2006, martin f krafft wrote:

 also sprach dean gaudet [EMAIL PROTECTED] [2006.11.13.1107 +0100]:
  which causes the is_true() in info() to return 1 which causes the set -e 
  to terminate the script.
 
 What shell are you using?

my SHELL=/bin/zsh, but that won't affect the script... the script is 
#!/bin/sh ... and /bin/sh - bash.

-dean


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Bug#398310: Re: Bug#398310: don't assemble all arrays on install

2006-11-13 Thread dean gaudet

On Mon, 13 Nov 2006, martin f krafft wrote:

 also sprach dean gaudet [EMAIL PROTECTED] [2006.11.13.1116 +0100]:
  right, now i know that i should create an /etc/default/mdadm
  *before* i install mdadm... because unlike other packages, mdadm
  does potentially dangerous things just by installing it.  i'll
  keep that in mind :)
 
 You could also just reconfigure your debconf to show questions of
 low priority; since you're juggling disks, it seems like that's the
 more appropriate level.
 
 I have raised the question for INITRDSTART to high priority.

thanks!


  i think the only solution is to go entirely event based... start
  meshing into udev or something.  you'd have to be able to express
  the dependencies of a device/filesystem somehow though.  ugh.
 
 we have plans into this direction.

yeah... i've been meaning to ramp up on them.


  actually, after playing with the disks with md, and then moving
  them into 3ware hardware raid, i did zero the disks... through the
  3ware hw raid.  the problem is that the 3ware hw raid superblock
  is even larger than the md raid superblock (100MB vs. a few MB in
  my limited experiments)... so even though i zeroed the hw raid
  device it went nowhere near the stale md superblock (even the
  3ware hw raid superblock never touched it).
 
 they are likely at opposite ends of the disk.

the 3ware superblock is at the end of the disk similar to mdadm... i 
actually successfully pulled the disks from a 3ware raid10 and constructed 
a md raid0 with two of the disks with mdadm --build (and recovered the 
data which the 3ware had decided to lose)... and i didn't have to do 
anything complex other than figure out which two disks to use.

i've done a similar experiment with a 3ware raid1 disk... i could mount it 
just found on a non-3ware device.

-dean


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Bug#398347: hooks should respect run-parts naming conventions

2006-11-13 Thread dean gaudet

Package: initramfs-tools
Version: 0.85a

the run_scripts() function should respect the same naming conventions as 
run-parts(8) ... in particular if my editor creates foo~, foo.bak, 
.foo.swp files run_scripts() will try to run them.  ditto for foo,v.

run_scripts() should also not attempt to invoke directories (i.e. RCS, 
CVS).

in particular i've run into troubles in the 
/usr/share/initramfs-tools/hooks directory... while trying to fix bugs in 
a hook i did cp foo{,.orig} so that i would be able to create a patch 
for upstream and ran into troubles because the foo.orig was executed as 
well.

i imagine there'll be .dpkg-old / .dpkg-dist /etc. troubles for hooks 
which are conffiles (which i'd hope most hooks would be).

-dean


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Bug#398310: don't assemble all arrays on install

2006-11-13 Thread dean gaudet

On Mon, 13 Nov 2006, martin f krafft wrote:

 severity 398310 important
 retitle 398310 let user choose when to start which array
 tags 398310 confirmed help
 thanks
 
 also sprach dean gaudet [EMAIL PROTECTED] [2006.11.13.0230 +0100]:
  i had 4 disks which i had experimented with sw raid10 on a few months 
  back... i never zeroed the superblocks.  i ended up putting them into 
  production in a 3ware hw raid10.  today the 3ware freaked out... and i put 
  the disks into another box to attempt forensics and to try constructing 
  *read-only* software arrays to see if i could recover the data.
  
  when i did apt-get install mdadm it found the old superblocks from my 
  experiments a few months ago... and tried to start the array!
 
 You can set AUTOSTART=false in /etc/default/mdadm or via debconf,
 and no arrays will be started.

right, now i know that i should create an /etc/default/mdadm *before* i 
install mdadm... because unlike other packages, mdadm does potentially 
dangerous things just by installing it.  i'll keep that in mind :)


 I do like the idea of selecting which arrays to start when.
 Ideally, for each array, you'd select whether to start it from
 initramfs, from init.d at boot, from init.d at install time, or from
 init.d run manually. You can distinguish between the latter three
 using the runlevel and a custom variable passed from postinst.

it gets worse when you start considering external bitmaps... i posted to 
linux-raid about the dependency problems here.  you can't autostart an 
array with external bitmap until the bitmap is available... and if the 
bitmap is on a filesystem which is on another md device (think many disk 
raid5 external bitmap on raid1 root disks) then you need some md devices 
to start, some filesystems to be mounted, and then some more md devices to 
start and more filesystems to be mounted.

i think the only solution is to go entirely event based... start meshing 
into udev or something.  you'd have to be able to express the dependencies 
of a device/filesystem somehow though.  ugh.


 In any case, I don't consider the bug you filed to be grave because
 you forgot to zero the superblocks.

actually, after playing with the disks with md, and then moving them into 
3ware hardware raid, i did zero the disks... through the 3ware hw raid.  
the problem is that the 3ware hw raid superblock is even larger than the 
md raid superblock (100MB vs. a few MB in my limited experiments)... so 
even though i zeroed the hw raid device it went nowhere near the stale md 
superblock (even the 3ware hw raid superblock never touched it).

it took me a while to figure out that this was what happenned -- at first 
i thought mdadm had somehow read 3ware superblocks... there had been talk 
of an industry standard but i was skeptical it ever went anywhere.

-dean


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Bug#398347: hooks should respect run-parts naming conventions

2006-11-13 Thread dean gaudet

Package: initramfs-tools
Version: 0.85a

the run_scripts() function should respect the same naming conventions as 
run-parts(8) ... in particular if my editor creates foo~, foo.bak, 
.foo.swp files run_scripts() will try to run them.  ditto for foo,v.

run_scripts() should also not attempt to invoke directories (i.e. RCS, 
CVS).

in particular i've run into troubles in the 
/usr/share/initramfs-tools/hooks directory... while trying to fix bugs in 
a hook i did cp foo{,.orig} so that i would be able to create a patch 
for upstream and ran into troubles because the foo.orig was executed as 
well.

i imagine there'll be .dpkg-old / .dpkg-dist /etc. troubles for hooks 
which are conffiles (which i'd hope most hooks would be).

-dean


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Bug#398310: don't assemble all arrays on install

2006-11-12 Thread dean gaudet

Package: mdadm
Version: 2.5.5-1
Severity: grave

it's dangerous to generate an mdadm.conf and start running arrays 
automatically at install time!  i nearly got bit by this.

i marked this grave because there's a potential for data loss with the 
current install scripts.

i had 4 disks which i had experimented with sw raid10 on a few months 
back... i never zeroed the superblocks.  i ended up putting them into 
production in a 3ware hw raid10.  today the 3ware freaked out... and i put 
the disks into another box to attempt forensics and to try constructing 
*read-only* software arrays to see if i could recover the data.

when i did apt-get install mdadm it found the old superblocks from my 
experiments a few months ago... and tried to start the array!

fortunately i had issued blockdev --setro /dev/sd[defg] prior to doing 
any of this, so the block layer saved my ass.

otherwise mdadm would have happily screwed around with the data at the end 
of the disks... and *even worse* might have decided recovery was necessary 
and really screwed things up!

it's *bad* to autostart all discovered arrays.  it's unfortunate enough 
that you've decided to make initrds start all arrays by default... but at 
least this install-time autodiscover and start everything should be 
optional.

at a minimum i think there should be a dialog attempt to autodiscover all 
arrays and start them?.  even better would be a second step i found the 
following arrays, which ones should i start?

-dean

p.s. regardless of this complaint, i'm totally happy with the newer 
initramfs which handles renames more gracefully... and with the monthly 
checkarray default.  thanks!


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

Bug#398312: INITRDSTART='none' doesn't work

2006-11-12 Thread dean gaudet

Package: mdadm
Version: 2.5.5-1
Severity: grave

even though i have INITRDSTART='none' in my /etc/default/mdadm and rebuilt 
the initrd, it still goes and does array discovery at boot time.

this is marked grave because it can cause dataloss if drives with stale 
superblocks are put together in an unexpected manner resulting in an array 
rebuild.  (i.e. same reasoning as #398310)

here's my current setup:

# grep -ve '^#' -e '^ *$' /etc/mdadm/mdadm.conf
DEVICE partitions
CREATE owner=root group=disk mode=0660 auto=yes
HOMEHOST system
MAILADDR root
# grep -ve '^#' -e '^ *$' /etc/default/mdadm
INITRDSTART='none'
AUTOSTART=false
AUTOCHECK=false
START_DAEMON=false
VERBOSE=false
USE_DEPRECATED_MDRUN=false

notice i have no arrays defined.

# dpkg-reconfigure linux-image-`uname -r`
Running depmod.
Finding valid ramdisk creators.
Using mkinitramfs-kpkg to build the ramdisk.
W: mdadm: /etc/mdadm/mdadm.conf defines no arrays.
Not updating initrd symbolic links since we are being 
updated/reinstalled
(twinlark.1 was configured last, according to dpkg)
Not updating image symbolic links since we are being updated/reinstalled
(twinlark.1 was configured last, according to dpkg)
Running postinst hook script /sbin/update-grub.
Searching for GRUB installation directory ... found: /boot/grub
Testing for an existing GRUB menu.lst file ... found: 
/boot/grub/menu.lst
Searching for splash image ... none found, skipping ...
Found kernel: /boot/vmlinuz-2.6.17.11
Found kernel: /boot/vmlinuz-2.6.17-2-amd64
Found kernel: /boot/vmlinuz-2.6.16-2-amd64-generic
Found kernel: /boot/memtest86+.bin
Updating /boot/grub/menu.lst ... done

notice it complains that i have no arrays defined.

# mkdir /tmp/initrd
# cd /tmp/initrd
# zcat /boot/initrd.img-`uname -r` | cpio -i
26975 blocks

ok now i look at scripts/local-top/mdadm ... i note it sets MD_DEVS=all,
which presumably should be overridden by /conf/md.conf... yet conf/md.conf
contains:

# cat conf/md.conf
MD_HOMEHOST='groove242'

missing MD_DEVS=none.

also, scripts/local-top/mdadm goes on to test there's an
/etc/mdadm/mdadm.conf, which isn't present in the initrd.  because
etc/mdadm/mdadm.conf isn't there, scripts/local-top/mdadm goes on to
autodiscover all arrays... and then because of the missing MD_DEVS=none
it assembles them all.

as mentioned, this can result in data loss.

while i think the root of the problem is that MD_DEVS=none wasn't copied
from /etc/default/mdadm settings... i think this habit of discovering
and starting all arrays is a bad one.  if i built my initrd without an
mdadm.conf i don't see why you would create one... maybe if you asked first
unable to find root device, should i try to autodiscover and start arrays?
or required an option on the kernel command line...

anyhow, now to go see if this didn't ruin the drives i'm trying to recover
(see #398310).

-dean


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]

[rdiff-backup-users] rdiff-backup 1.1.7 released

2006-11-12 Thread dean gaudet

fixes the OSX showstopper in 1.1.6.

http://savannah.nongnu.org/download/rdiff-backup/rdiff-backup-1.1.7.tar.gz

i have to admit, i haven't tested these releases... but i'm still of the 
opinion it's better for me to commit / release than it is to just let 
things stagnate :)

-dean

New in v1.1.7 (2006/11/12)
--

Fix showstopper problem on OSX handling pre-1.1.6 rdiff-backup metadata.
(Patch from Andrew Ferguson.)



___
rdiff-backup-users mailing list at rdiff-backup-users@nongnu.org
http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki

< 1 2 3 4 5 6 7 8 9 10 >

201 - 300 of 1643 matches

Mail list logo