Re: [PATCH] msleep() with hrtimers

2007-08-09 Thread Denis Vlasenko
On 8/9/07, Denis Vlasenko <[EMAIL PROTECTED]> wrote:
> On 8/8/07, Arjan van de Ven <[EMAIL PROTECTED]> wrote:
> > You keep claiming that hrtimers are so incredibly expensive; but for
> > msleep()... which is mostly called during driver init ... I really don't
> > buy that it's really expensive. We're not doing this a gazilion times
> > per second obviously...
>
> Yes. Optimizing delay or sleep functions for speed is a contradiction
> of terms. IIRC we still optimize udelay for speed, not code size...
> Read it again folks:
>
> We optimize udelay for speed
>
> How fast your udelay do you want to be today?


Just checked. i386 and x86-64 seems to be sane - udelay() and ndelay()
are not inlined.

Several arches are still frantically try to make udelay faster. Many have
the same comment:

/*
 * Use only for very small delays ( < 1 msec).  Should probably use a
 * lookup table, really, as the multiplications take much too long with
 * short delays.  This is a "reasonable" implementation, though (and the
 * first constant multiplications gets optimized away if the delay is
 * a constant)
 */

and thus seem to be a cut-n-paste code.

BTW, almost all arched have __const_udelay(N) which obviously
does not delay for N usecs:

#define udelay(n) (__builtin_constant_p(n) ? \
((n) > 2 ? __bad_udelay() : __const_udelay((n) * 0x10c7ul)) : \
__udelay(n))

Bad name.

Are patches which de-inline udelay and do s/__const_udelay/__const_delay/g
be accepted?

Arches with udelay's still inlined are below. mips is especially big.
frv has totally bogus ndelay().

include/asm-ppc/delay.h
extern __inline__ void __udelay(unsigned int x)
{
unsigned int loops;
__asm__("mulhwu %0,%1,%2" : "=r" (loops) :
"r" (x), "r" (loops_per_jiffy * 226));
__delay(loops);
}


include/asm-parisc/delay.h
static __inline__ void __udelay(unsigned long usecs) {
__cr16_delay(usecs * ((unsigned long)boot_cpu_data.cpu_hz / 100UL));
}


include/asm-mips/delay.h
static inline void __udelay(unsigned long usecs, unsigned long lpj)
{
unsigned long lo;

/*
 * The rates of 128 is rounded wrongly by the catchall case
 * for 64-bit.  Excessive precission?  Probably ...
 */
#if defined(CONFIG_64BIT) && (HZ == 128)
usecs *= 0x0008637bd05af6c7UL;  /* 2**64 / (100 / HZ) */
#elif defined(CONFIG_64BIT)
usecs *= (0x8000UL / (50 / HZ));
#else /* 32-bit junk follows here */
usecs *= (unsigned long) (((0x8000ULL / (50 / HZ)) +
   0x8000ULL) >> 32);
#endif

if (sizeof(long) == 4)
__asm__("multu\t%2, %3"
: "=h" (usecs), "=l" (lo)
: "r" (usecs), "r" (lpj)
: GCC_REG_ACCUM);
else if (sizeof(long) == 8)
__asm__("dmultu\t%2, %3"
: "=h" (usecs), "=l" (lo)
: "r" (usecs), "r" (lpj)
: GCC_REG_ACCUM);

__delay(usecs);
}


include/asm-m68k/delay.h
static inline void __udelay(unsigned long usecs)
{
__const_udelay(usecs * 4295);   /* 2**32 / 100 */
}


include/asm-h8300/delay.h
static inline void udelay(unsigned long usecs)
{
usecs *= 4295;  /* 2**32 / 100 */
usecs /= (loops_per_jiffy*HZ);
if (usecs)
__delay(usecs);
}


include/asm-frv/delay.h
static inline void udelay(unsigned long usecs)
{
__delay(usecs * __delay_loops_MHz);
}
#define ndelay(n)   udelay((n) * 5)


include/asm-xtensa/delay.h
static __inline__ void udelay (unsigned long usecs)
{
unsigned long start = xtensa_get_ccount();
unsigned long cycles = usecs * (loops_per_jiffy / (100UL / HZ));

/* Note: all variables are unsigned (can wrap around)! */
while (((unsigned long)xtensa_get_ccount()) - start < cycles)
;
}


include/asm-v850/delay.h
static inline void udelay(unsigned long usecs)
{
register unsigned long full_loops, part_loops;
full_loops = ((usecs * HZ) / 100) * loops_per_jiffy;
usecs %= (100 / HZ);
part_loops = (usecs * HZ * loops_per_jiffy) / 100;
__delay(full_loops + part_loops);
}


include/asm-cris/delay.h
static inline void udelay(unsigned long usecs)
{
__delay(usecs * loops_per_usec);
}


include/asm-blackfin/delay.h
static inline void udelay(unsigned long usecs)
{
extern unsigned long loops_per_jiffy;
__delay(usecs * loops_per_jiffy / (100 / HZ));
}
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] msleep() with hrtimers

2007-08-09 Thread Denis Vlasenko
On 8/8/07, Arjan van de Ven <[EMAIL PROTECTED]> wrote:
> You keep claiming that hrtimers are so incredibly expensive; but for
> msleep()... which is mostly called during driver init ... I really don't
> buy that it's really expensive. We're not doing this a gazilion times
> per second obviously...

Yes. Optimizing delay or sleep functions for speed is a contradiction
of terms. IIRC we still optimize udelay for speed, not code size...
Read it again folks:

We optimize udelay for speed

How fast your udelay do you want to be today?

Oh well.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] msleep() with hrtimers

2007-08-09 Thread Denis Vlasenko
On 8/8/07, Arjan van de Ven [EMAIL PROTECTED] wrote:
 You keep claiming that hrtimers are so incredibly expensive; but for
 msleep()... which is mostly called during driver init ... I really don't
 buy that it's really expensive. We're not doing this a gazilion times
 per second obviously...

Yes. Optimizing delay or sleep functions for speed is a contradiction
of terms. IIRC we still optimize udelay for speed, not code size...
Read it again folks:

We optimize udelay for speed

How fast your udelay do you want to be today?

Oh well.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] msleep() with hrtimers

2007-08-09 Thread Denis Vlasenko
On 8/9/07, Denis Vlasenko [EMAIL PROTECTED] wrote:
 On 8/8/07, Arjan van de Ven [EMAIL PROTECTED] wrote:
  You keep claiming that hrtimers are so incredibly expensive; but for
  msleep()... which is mostly called during driver init ... I really don't
  buy that it's really expensive. We're not doing this a gazilion times
  per second obviously...

 Yes. Optimizing delay or sleep functions for speed is a contradiction
 of terms. IIRC we still optimize udelay for speed, not code size...
 Read it again folks:

 We optimize udelay for speed

 How fast your udelay do you want to be today?


Just checked. i386 and x86-64 seems to be sane - udelay() and ndelay()
are not inlined.

Several arches are still frantically try to make udelay faster. Many have
the same comment:

/*
 * Use only for very small delays (  1 msec).  Should probably use a
 * lookup table, really, as the multiplications take much too long with
 * short delays.  This is a reasonable implementation, though (and the
 * first constant multiplications gets optimized away if the delay is
 * a constant)
 */

and thus seem to be a cut-n-paste code.

BTW, almost all arched have __const_udelay(N) which obviously
does not delay for N usecs:

#define udelay(n) (__builtin_constant_p(n) ? \
((n)  2 ? __bad_udelay() : __const_udelay((n) * 0x10c7ul)) : \
__udelay(n))

Bad name.

Are patches which de-inline udelay and do s/__const_udelay/__const_delay/g
be accepted?

Arches with udelay's still inlined are below. mips is especially big.
frv has totally bogus ndelay().

include/asm-ppc/delay.h
extern __inline__ void __udelay(unsigned int x)
{
unsigned int loops;
__asm__(mulhwu %0,%1,%2 : =r (loops) :
r (x), r (loops_per_jiffy * 226));
__delay(loops);
}


include/asm-parisc/delay.h
static __inline__ void __udelay(unsigned long usecs) {
__cr16_delay(usecs * ((unsigned long)boot_cpu_data.cpu_hz / 100UL));
}


include/asm-mips/delay.h
static inline void __udelay(unsigned long usecs, unsigned long lpj)
{
unsigned long lo;

/*
 * The rates of 128 is rounded wrongly by the catchall case
 * for 64-bit.  Excessive precission?  Probably ...
 */
#if defined(CONFIG_64BIT)  (HZ == 128)
usecs *= 0x0008637bd05af6c7UL;  /* 2**64 / (100 / HZ) */
#elif defined(CONFIG_64BIT)
usecs *= (0x8000UL / (50 / HZ));
#else /* 32-bit junk follows here */
usecs *= (unsigned long) (((0x8000ULL / (50 / HZ)) +
   0x8000ULL)  32);
#endif

if (sizeof(long) == 4)
__asm__(multu\t%2, %3
: =h (usecs), =l (lo)
: r (usecs), r (lpj)
: GCC_REG_ACCUM);
else if (sizeof(long) == 8)
__asm__(dmultu\t%2, %3
: =h (usecs), =l (lo)
: r (usecs), r (lpj)
: GCC_REG_ACCUM);

__delay(usecs);
}


include/asm-m68k/delay.h
static inline void __udelay(unsigned long usecs)
{
__const_udelay(usecs * 4295);   /* 2**32 / 100 */
}


include/asm-h8300/delay.h
static inline void udelay(unsigned long usecs)
{
usecs *= 4295;  /* 2**32 / 100 */
usecs /= (loops_per_jiffy*HZ);
if (usecs)
__delay(usecs);
}


include/asm-frv/delay.h
static inline void udelay(unsigned long usecs)
{
__delay(usecs * __delay_loops_MHz);
}
#define ndelay(n)   udelay((n) * 5)


include/asm-xtensa/delay.h
static __inline__ void udelay (unsigned long usecs)
{
unsigned long start = xtensa_get_ccount();
unsigned long cycles = usecs * (loops_per_jiffy / (100UL / HZ));

/* Note: all variables are unsigned (can wrap around)! */
while (((unsigned long)xtensa_get_ccount()) - start  cycles)
;
}


include/asm-v850/delay.h
static inline void udelay(unsigned long usecs)
{
register unsigned long full_loops, part_loops;
full_loops = ((usecs * HZ) / 100) * loops_per_jiffy;
usecs %= (100 / HZ);
part_loops = (usecs * HZ * loops_per_jiffy) / 100;
__delay(full_loops + part_loops);
}


include/asm-cris/delay.h
static inline void udelay(unsigned long usecs)
{
__delay(usecs * loops_per_usec);
}


include/asm-blackfin/delay.h
static inline void udelay(unsigned long usecs)
{
extern unsigned long loops_per_jiffy;
__delay(usecs * loops_per_jiffy / (100 / HZ));
}
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm] Introduce strtol_check_range()

2007-08-02 Thread Denis Vlasenko
On 8/2/07, Alexey Dobriyan <[EMAIL PROTECTED]> wrote:
> > > Please, copy strtonum() from BSD instead. Nobody needs another
> > > home-grown converter.
> >
> > BSD's strtonum(3) is a detestful, horrible shame.
> >
> > The strtol_check_range() I implemented here does _all_ that strtonum()
> > does, plus is generic w.r.t. base,
>
> What you did with base argument is creating opportunity to fsckup,
> namely, forgetting that base is last and putting it second.

Embedding base in function name (func10, func8, func16 [, func2])
will eliminate that possibility and also save one argument
push on stack.

You can always multiplex them locally:

static int func_generic(base...) {...}

int func10(...) { return func_generic(10, ); }
int func8(...) { return func_generic(8, ); }

You also can have a faster "static int func_power_of_2(base...)" for
2,8,16, etc.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm] Introduce strtol_check_range()

2007-08-02 Thread Denis Vlasenko
On 8/2/07, Alexey Dobriyan [EMAIL PROTECTED] wrote:
   Please, copy strtonum() from BSD instead. Nobody needs another
   home-grown converter.
 
  BSD's strtonum(3) is a detestful, horrible shame.
 
  The strtol_check_range() I implemented here does _all_ that strtonum()
  does, plus is generic w.r.t. base,

 What you did with base argument is creating opportunity to fsckup,
 namely, forgetting that base is last and putting it second.

Embedding base in function name (func10, func8, func16 [, func2])
will eliminate that possibility and also save one argument
push on stack.

You can always multiplex them locally:

static int func_generic(base...) {...}

int func10(...) { return func_generic(10, ); }
int func8(...) { return func_generic(8, ); }

You also can have a faster static int func_power_of_2(base...) for
2,8,16, etc.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] debloat aic7xxx and aic79xx drivers by deinlining

2007-07-31 Thread Denis Vlasenko
Hi,

Attached patch deinlines and moves big functions from .h to .c files
in drivers/scsi/aic7xxx/*. I also had to add prototypes for ahc_lookup_scb
and ahd_lookup_scb to .h files.

No other code changes made.

Compile-tested on i386 and x86-64.
Total .text size reduction: ~60k in 64 bits, ~90k in 32 bits.

Per-object-file and whole-module size difference:
for x86-64:
   textdata bss dec hex filename
 261433   500181172  312623   4c52f org/built-in.o
 199622   500181172  250812   3d3bc aic/built-in.o
  116807168   0   1884849a0 org/aic7xxx_reg_print.o
  116807168   0   1884849a0 aic/aic7xxx_reg_print.o
   3065   0   03065 bf9 org/aic7xxx_proc.o
   2849   0   02849 b21 aic/aic7xxx_proc.o
  160371984   0   180214665 org/aic7xxx_pci.o
  128961984   0   148803a20 aic/aic7xxx_pci.o
   19774768   067451a59 aic/aic7xxx_osm_pci.o
   17044768   064721948 org/aic7xxx_osm_pci.o
  15033 865 564   16462404e org/aic7xxx_osm.o
  13752 865 564   151813b4d aic/aic7xxx_osm.o
  532287424   0   60652ecec org/aic7xxx_core.o
  429257424   0   50349c4ad aic/aic7xxx_core.o
   3193  72   03265 cc1 org/aic7xxx_93cx6.o
   1778  72   01850 73a aic/aic7xxx_93cx6.o
 103971   22321 564  126856   1ef88 org/aic7xxx.o
  87888   22321 564  110773   1b0b5 aic/aic7xxx.o
  25743   14016   0   397599b4f org/aic79xx_reg_print.o
  25743   14016   0   397599b4f aic/aic79xx_reg_print.o
   3312   0   03312 cf0 org/aic79xx_proc.o
   2764   0   02764 acc aic/aic79xx_proc.o
   9420 544  2499882704 org/aic79xx_pci.o
   6539 544  2471071bc3 aic/aic79xx_pci.o
   18056336   081411fcd org/aic79xx_osm_pci.o
   17916336   081271fbf aic/aic79xx_osm_pci.o
  189821189 564   2073550ff org/aic79xx_osm.o
  172871189 564   190404a60 aic/aic79xx_osm.o
  981605600   0  103760   19550 org/aic79xx_core.o
  575725600   0   63172f6c4 aic/aic79xx_core.o
 157435   27697 596  185728   2d580 org/aic79xx.o
 111708   27697 596  140001   222e1 aic/aic79xx.o

and for i386:
   textdata bss dec hex filename
 280361   326331112  314106   4cafa org/built-in.o
 190406   326331112  224151   36b97 aic/built-in.o
  116973336   0   150333ab9 org/aic7xxx_reg_print.o
  116973336   0   150333ab9 aic/aic7xxx_reg_print.o
   2970   0   02970 b9a org/aic7xxx_proc.o
   2698   0   02698 a8a aic/aic7xxx_proc.o
  167001488   0   18188470c org/aic7xxx_pci.o
  119841488   0   1347234a0 aic/aic7xxx_pci.o
   18574044   05901170d aic/aic7xxx_osm_pci.o
   15754044   0561915f3 org/aic7xxx_osm_pci.o
  14876 561 548   159853e71 org/aic7xxx_osm.o
  12849 561 548   139583686 aic/aic7xxx_osm.o
  589595512   0   64471fbd7 org/aic7xxx_core.o
  409075512   0   46419b553 aic/aic7xxx_core.o
   3851  72   03923 f53 org/aic7xxx_93cx6.o
   1618  72   01690 69a aic/aic7xxx_93cx6.o
 110645   15013 548  126206   1ecfe org/aic7xxx.o
  83619   15013 548   99180   1836c aic/aic7xxx.o
  257626496   0   322587e02 org/aic79xx_reg_print.o
  257626496   0   322587e02 aic/aic79xx_reg_print.o
   3258   0   03258 cba org/aic79xx_proc.o
   2619   0   02619 a3b aic/aic79xx_proc.o
  10082 408  12   105022906 org/aic79xx_pci.o
   6145 408  12656519a5 aic/aic79xx_pci.o
   17165416   071321bdc org/aic79xx_osm_pci.o
   17045416   071201bd0 aic/aic79xx_osm_pci.o
  18499 865 552   199164dcc org/aic79xx_osm.o
  16232 865 552   1764944f1 aic/aic79xx_osm.o
 1103914432   0  114823   1c087 org/aic79xx_core.o
  543144432   0   58746e57a aic/aic79xx_core.o
 169715   17617 564  187896   2ddf8 org/aic79xx.o
 106784   17617 564  124965   1e825 aic/aic79xx.o

Please apply.

Signed-off-by: Denys Vlasenko <[EMAIL PROTECTED]>
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] debloat aic7xxx and aic79xx drivers by deinlining

2007-07-31 Thread Denis Vlasenko
Hi,

Attached patch deinlines and moves big functions from .h to .c files
in drivers/scsi/aic7xxx/*. I also had to add prototypes for ahc_lookup_scb
and ahd_lookup_scb to .h files.

No other code changes made.

Compile-tested on i386 and x86-64.
Total .text size reduction: ~60k in 64 bits, ~90k in 32 bits.

Per-object-file and whole-module size difference:
for x86-64:
   textdata bss dec hex filename
 261433   500181172  312623   4c52f org/built-in.o
 199622   500181172  250812   3d3bc aic/built-in.o
  116807168   0   1884849a0 org/aic7xxx_reg_print.o
  116807168   0   1884849a0 aic/aic7xxx_reg_print.o
   3065   0   03065 bf9 org/aic7xxx_proc.o
   2849   0   02849 b21 aic/aic7xxx_proc.o
  160371984   0   180214665 org/aic7xxx_pci.o
  128961984   0   148803a20 aic/aic7xxx_pci.o
   19774768   067451a59 aic/aic7xxx_osm_pci.o
   17044768   064721948 org/aic7xxx_osm_pci.o
  15033 865 564   16462404e org/aic7xxx_osm.o
  13752 865 564   151813b4d aic/aic7xxx_osm.o
  532287424   0   60652ecec org/aic7xxx_core.o
  429257424   0   50349c4ad aic/aic7xxx_core.o
   3193  72   03265 cc1 org/aic7xxx_93cx6.o
   1778  72   01850 73a aic/aic7xxx_93cx6.o
 103971   22321 564  126856   1ef88 org/aic7xxx.o
  87888   22321 564  110773   1b0b5 aic/aic7xxx.o
  25743   14016   0   397599b4f org/aic79xx_reg_print.o
  25743   14016   0   397599b4f aic/aic79xx_reg_print.o
   3312   0   03312 cf0 org/aic79xx_proc.o
   2764   0   02764 acc aic/aic79xx_proc.o
   9420 544  2499882704 org/aic79xx_pci.o
   6539 544  2471071bc3 aic/aic79xx_pci.o
   18056336   081411fcd org/aic79xx_osm_pci.o
   17916336   081271fbf aic/aic79xx_osm_pci.o
  189821189 564   2073550ff org/aic79xx_osm.o
  172871189 564   190404a60 aic/aic79xx_osm.o
  981605600   0  103760   19550 org/aic79xx_core.o
  575725600   0   63172f6c4 aic/aic79xx_core.o
 157435   27697 596  185728   2d580 org/aic79xx.o
 111708   27697 596  140001   222e1 aic/aic79xx.o

and for i386:
   textdata bss dec hex filename
 280361   326331112  314106   4cafa org/built-in.o
 190406   326331112  224151   36b97 aic/built-in.o
  116973336   0   150333ab9 org/aic7xxx_reg_print.o
  116973336   0   150333ab9 aic/aic7xxx_reg_print.o
   2970   0   02970 b9a org/aic7xxx_proc.o
   2698   0   02698 a8a aic/aic7xxx_proc.o
  167001488   0   18188470c org/aic7xxx_pci.o
  119841488   0   1347234a0 aic/aic7xxx_pci.o
   18574044   05901170d aic/aic7xxx_osm_pci.o
   15754044   0561915f3 org/aic7xxx_osm_pci.o
  14876 561 548   159853e71 org/aic7xxx_osm.o
  12849 561 548   139583686 aic/aic7xxx_osm.o
  589595512   0   64471fbd7 org/aic7xxx_core.o
  409075512   0   46419b553 aic/aic7xxx_core.o
   3851  72   03923 f53 org/aic7xxx_93cx6.o
   1618  72   01690 69a aic/aic7xxx_93cx6.o
 110645   15013 548  126206   1ecfe org/aic7xxx.o
  83619   15013 548   99180   1836c aic/aic7xxx.o
  257626496   0   322587e02 org/aic79xx_reg_print.o
  257626496   0   322587e02 aic/aic79xx_reg_print.o
   3258   0   03258 cba org/aic79xx_proc.o
   2619   0   02619 a3b aic/aic79xx_proc.o
  10082 408  12   105022906 org/aic79xx_pci.o
   6145 408  12656519a5 aic/aic79xx_pci.o
   17165416   071321bdc org/aic79xx_osm_pci.o
   17045416   071201bd0 aic/aic79xx_osm_pci.o
  18499 865 552   199164dcc org/aic79xx_osm.o
  16232 865 552   1764944f1 aic/aic79xx_osm.o
 1103914432   0  114823   1c087 org/aic79xx_core.o
  543144432   0   58746e57a aic/aic79xx_core.o
 169715   17617 564  187896   2ddf8 org/aic79xx.o
 106784   17617 564  124965   1e825 aic/aic79xx.o

Please apply.

Signed-off-by: Denys Vlasenko [EMAIL PROTECTED]
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Patches for REALLY TINY 386 kernels

2007-07-30 Thread Denis Vlasenko
On Wednesday 18 July 2007 22:04, Andi Kleen wrote:
> Better just write less bloated code. Perhaps mandatory bloatometer
> runs during -rc*s for kernels with minimal config with public code pig shame 
> lists
> similar to the regression lists are useful. Anyone volunteering?
>
> I suspect there is also much more low hanging fruit of this around.

Thousands "static int flag" variables taking 4 bytes where 1 byte
(actually 1 bit) would suffice. And when you do "flag = 1" -
store insns for bytes are also shorter by 3 bytes, _each_.

Unused code/data linked in
(-ffunction-sections -fdata-sections -Wl,--gc-sections may help)

int global_n; char global_c; int global_m;
and you lose 3 bytes to alignment.
(How to instruct linker to sort sections by alignment or at least for size?
Tried -ffunction-sections -fdata-sections -Wl,--sort-section,alignment
but it seems to only (try to) sort .data, not .data.var_name sections)

Massive inlining. Example: more than 80k of bloat in aic7*xx driver
because of gigantic inlined I/O access functions.

Sadistic alignment by gcc for structs/strings >= 32 bytes.
(gcc 4.2.1 is better, just don't forget -mpreferred-stack-boundary=2)
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/8] i386: bitops: Cleanup, sanitize, optimize

2007-07-30 Thread Denis Vlasenko
Hi Satyam,

On Monday 23 July 2007 17:05, Satyam Sharma wrote:
> There was a lot of bogus stuff that include/asm-i386/bitops.h was doing,
> that was unnecessary and not required for the correctness of those APIs.
> All that superfluous stuff was also unnecessarily disallowing compiler
> optimization possibilities, and making gcc generate code that wasn't as
> beautiful as it could otherwise have been. Hence the following series
> of cleanups (some trivial, some surprising, in no particular order):

[I did read entire thread]

Welcome to the minefield.

This bitops and barrier stuff is complicated. It's very easy to
introduce bugs which are hard to trigger, or happen only with some
specific gcc versions, or only on massively parallel SMP boxes.

You can also make technically correct changes which relax needlessly
strict barrier semantics of some bitops and trigger latent bugs
in code which was unknowingly depending on it.

How you can proceed:

Make a change which you believe is right. Recompile allyesconfig
kernel with and without this change. Find a piece of assembly code
which become different. Check that new code is correct (and smaller
and/or faster). Post your patch together with example(s) of code
fragments that got better. Be as verbose as needed.

Repeat for each change separately.

This can be painfully slow, but less likely to be rejected outright
in fear of introducing difficult bugs.

> * Marking "memory" as clobbered for no good reason

I vaguely remember that "memory" clobbers are needed in some rather
obscure, non-obvious situations. Google for it - Linus wrote about it
to lkml (a few years ago IIRC).

> * Volatile-casting of memory addresses
>   (wholly unnecessary, makes gcc generate bad code)

Do you know any code difference resulting from this patch?

> * Unwarranted use of __asm__ __volatile__ even when those semantics
>   are not required

ditto

> * Unnecessarily harsh definitions of smp_mb__{before, after}_clear_bit()
>   (again, this was like *asking* gcc to generate bad code)

ditto

> My testbox boots/works fine with all these patches (uptime half an hour)

For this kind of things, you really need something more stressing.
Try to find big SMP people who is willing to give it a whirl.

> and the compressed bzImage is smaller by about ~2 KB for my .config --

At least it proves that _something_ changed.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/8] i386: bitops: Cleanup, sanitize, optimize

2007-07-30 Thread Denis Vlasenko
Hi Satyam,

On Monday 23 July 2007 17:05, Satyam Sharma wrote:
 There was a lot of bogus stuff that include/asm-i386/bitops.h was doing,
 that was unnecessary and not required for the correctness of those APIs.
 All that superfluous stuff was also unnecessarily disallowing compiler
 optimization possibilities, and making gcc generate code that wasn't as
 beautiful as it could otherwise have been. Hence the following series
 of cleanups (some trivial, some surprising, in no particular order):

[I did read entire thread]

Welcome to the minefield.

This bitops and barrier stuff is complicated. It's very easy to
introduce bugs which are hard to trigger, or happen only with some
specific gcc versions, or only on massively parallel SMP boxes.

You can also make technically correct changes which relax needlessly
strict barrier semantics of some bitops and trigger latent bugs
in code which was unknowingly depending on it.

How you can proceed:

Make a change which you believe is right. Recompile allyesconfig
kernel with and without this change. Find a piece of assembly code
which become different. Check that new code is correct (and smaller
and/or faster). Post your patch together with example(s) of code
fragments that got better. Be as verbose as needed.

Repeat for each change separately.

This can be painfully slow, but less likely to be rejected outright
in fear of introducing difficult bugs.

 * Marking memory as clobbered for no good reason

I vaguely remember that memory clobbers are needed in some rather
obscure, non-obvious situations. Google for it - Linus wrote about it
to lkml (a few years ago IIRC).

 * Volatile-casting of memory addresses
   (wholly unnecessary, makes gcc generate bad code)

Do you know any code difference resulting from this patch?

 * Unwarranted use of __asm__ __volatile__ even when those semantics
   are not required

ditto

 * Unnecessarily harsh definitions of smp_mb__{before, after}_clear_bit()
   (again, this was like *asking* gcc to generate bad code)

ditto

 My testbox boots/works fine with all these patches (uptime half an hour)

For this kind of things, you really need something more stressing.
Try to find big SMP people who is willing to give it a whirl.

 and the compressed bzImage is smaller by about ~2 KB for my .config --

At least it proves that _something_ changed.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Patches for REALLY TINY 386 kernels

2007-07-30 Thread Denis Vlasenko
On Wednesday 18 July 2007 22:04, Andi Kleen wrote:
 Better just write less bloated code. Perhaps mandatory bloatometer
 runs during -rc*s for kernels with minimal config with public code pig shame 
 lists
 similar to the regression lists are useful. Anyone volunteering?

 I suspect there is also much more low hanging fruit of this around.

Thousands static int flag variables taking 4 bytes where 1 byte
(actually 1 bit) would suffice. And when you do flag = 1 -
store insns for bytes are also shorter by 3 bytes, _each_.

Unused code/data linked in
(-ffunction-sections -fdata-sections -Wl,--gc-sections may help)

int global_n; char global_c; int global_m;
and you lose 3 bytes to alignment.
(How to instruct linker to sort sections by alignment or at least for size?
Tried -ffunction-sections -fdata-sections -Wl,--sort-section,alignment
but it seems to only (try to) sort .data, not .data.var_name sections)

Massive inlining. Example: more than 80k of bloat in aic7*xx driver
because of gigantic inlined I/O access functions.

Sadistic alignment by gcc for structs/strings = 32 bytes.
(gcc 4.2.1 is better, just don't forget -mpreferred-stack-boundary=2)
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problematic __attribute__((section(" "))) and gcc alignment

2007-07-22 Thread Denis Vlasenko
On Thursday 21 June 2007 21:32, Mathieu Desnoyers wrote:
> Let's take arch/i386/boot/video.h as an example:
> 
> it defines 
> 
> struct card_info {
> const char *card_name;
> int (*set_mode)(struct mode_info *mode);
> int (*probe)(void);
> struct mode_info *modes;
> int nmodes; /* Number of probed modes so far */
> int unsafe; /* Probing is unsafe, only do after "scan" */
> u16 xmode_first;/* Unprobed modes to try to call anyway */
> u16 xmode_n;/* Size of unprobed mode range */
> };
> 
> Which is 28 bytes in size (so it is ok for now). If one single field is
> added, gcc will start aligning this structure on 32 bytes boundaries.
> (see http://gcc.gnu.org/ml/gcc-bugs/1999-11/msg00914.html)
> 
> We then have
> #define __videocard struct card_info __attribute__((section(".videocards")))
> extern struct card_info video_cards[], video_cards_end[];
> 
> Which instructs gcc to put these structures in the .videocards section.
> The linker scripts arch/i386/boot/setup.ld will assign video_cards and
> video_cards_end as pointers to the beginning and the end of this
> section. video_cards[0] is therefore expected to give the first
> structure in the section.
> 
> The problem with this is that gcc will align it on 32 bytes boundaries
> relative to what it "thinks" is the start of the section, which has
> nothing to do with the actual section layout given by the linker script.

The problem is that gcc is too eager to align stuff to some big power of two
upon reaching some irrelevant threshold. Why structures 32 bytes and more
in size should be aligned to 32 bytes (even if they have no doubles
and thus are not planned to be used by SSE code) is beyond me.
Why string literals of 32+ bytes are aligned is (beyond me)^2.

These are reverted in latest gcc (for -Os only):

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31319

but meanwhile gcc started to align stack to 16 bytes, *unconditionally*:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32849

I imagine 4K stack people will especially like it.

Apart from being bloaty, this also broke de-facto i386 ABI.
There is a solution which isnt bloaty and doesn't break the ABI.
But it wasn't chosen. :(
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problematic __attribute__((section( ))) and gcc alignment

2007-07-22 Thread Denis Vlasenko
On Thursday 21 June 2007 21:32, Mathieu Desnoyers wrote:
 Let's take arch/i386/boot/video.h as an example:
 
 it defines 
 
 struct card_info {
 const char *card_name;
 int (*set_mode)(struct mode_info *mode);
 int (*probe)(void);
 struct mode_info *modes;
 int nmodes; /* Number of probed modes so far */
 int unsafe; /* Probing is unsafe, only do after scan */
 u16 xmode_first;/* Unprobed modes to try to call anyway */
 u16 xmode_n;/* Size of unprobed mode range */
 };
 
 Which is 28 bytes in size (so it is ok for now). If one single field is
 added, gcc will start aligning this structure on 32 bytes boundaries.
 (see http://gcc.gnu.org/ml/gcc-bugs/1999-11/msg00914.html)
 
 We then have
 #define __videocard struct card_info __attribute__((section(.videocards)))
 extern struct card_info video_cards[], video_cards_end[];
 
 Which instructs gcc to put these structures in the .videocards section.
 The linker scripts arch/i386/boot/setup.ld will assign video_cards and
 video_cards_end as pointers to the beginning and the end of this
 section. video_cards[0] is therefore expected to give the first
 structure in the section.
 
 The problem with this is that gcc will align it on 32 bytes boundaries
 relative to what it thinks is the start of the section, which has
 nothing to do with the actual section layout given by the linker script.

The problem is that gcc is too eager to align stuff to some big power of two
upon reaching some irrelevant threshold. Why structures 32 bytes and more
in size should be aligned to 32 bytes (even if they have no doubles
and thus are not planned to be used by SSE code) is beyond me.
Why string literals of 32+ bytes are aligned is (beyond me)^2.

These are reverted in latest gcc (for -Os only):

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31319

but meanwhile gcc started to align stack to 16 bytes, *unconditionally*:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32849

I imagine 4K stack people will especially like it.

Apart from being bloaty, this also broke de-facto i386 ABI.
There is a solution which isnt bloaty and doesn't break the ABI.
But it wasn't chosen. :(
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [9/58] x86_64: Always use builtin memcpy on gcc 4.3

2007-07-21 Thread Denis Vlasenko
On Sunday 22 July 2007 00:16, Oleg Verych wrote:
> * From: Andi Kleen <[EMAIL PROTECTED]>
> * Date: Thu, 19 Jul 2007 11:54:53 +0200 (CEST)
> >
> > Jan asked to always use the builtin memcpy on gcc 4.3 mainline because
> > it should generate better code than the old macro. Let's try it.
> 
> Unfortunately such info is hard to find. The [EMAIL PROTECTED] list is
> empty. So, let me ask how this memcpy relates to recently submitted
> for glibc one [0]?
> 
> [0] 

Am I stupid or the files attached to that post demonstrate than "new"
code isn't much better and sometimes worse (aligned 4096 byte memcpy
went from 558 to 648 for Core 2)?

Beware that text files in test-memcpy.tar.bz2 seem to have
simple_memcpy / builtin_memcpy / memcpy columns swapped
(-old and -new files have them in different order).
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [9/58] x86_64: Always use builtin memcpy on gcc 4.3

2007-07-21 Thread Denis Vlasenko
On Sunday 22 July 2007 00:16, Oleg Verych wrote:
 * From: Andi Kleen [EMAIL PROTECTED]
 * Date: Thu, 19 Jul 2007 11:54:53 +0200 (CEST)
 
  Jan asked to always use the builtin memcpy on gcc 4.3 mainline because
  it should generate better code than the old macro. Let's try it.
 
 Unfortunately such info is hard to find. The [EMAIL PROTECTED] list is
 empty. So, let me ask how this memcpy relates to recently submitted
 for glibc one [0]?
 
 [0] http://permalink.gmane.org/gmane.comp.lib.glibc.alpha/12217

Am I stupid or the files attached to that post demonstrate than new
code isn't much better and sometimes worse (aligned 4096 byte memcpy
went from 558 to 648 for Core 2)?

Beware that text files in test-memcpy.tar.bz2 seem to have
simple_memcpy / builtin_memcpy / memcpy columns swapped
(-old and -new files have them in different order).
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?

2007-07-19 Thread Denis Vlasenko
On Tuesday 17 July 2007 00:42, Bodo Eggert wrote:
> > Please note that I was not trying to remove the 8K stack option right
> > now - heck, I didn't even add anything to feature-removal-schedule.txt
> > - all I wanted to accomplish with the patch that started this threas
> > was;  a) indicate that the 4K option is no longer a debug thing  and
> 
> Very ACK.
> 
> > b) make 4K stacks the default option in vanilla kernel.org kernels as
> > a gentle nudge towards getting people to start fixing the code paths
> > that are not 4K stack safe.
> 
> That's the big NACK. It's OK for MM, where things are supposed to be in a 
> not well-tested state, but for running possibly mission-critical systems,
> you should take no risk.

Mission-critical machines are not supposed to have kernel configured
with incompetent/careless sysadmin who didn't think about
config choices he made at kernel build time.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?

2007-07-19 Thread Denis Vlasenko
On Tuesday 17 July 2007 00:42, Bodo Eggert wrote:
  Please note that I was not trying to remove the 8K stack option right
  now - heck, I didn't even add anything to feature-removal-schedule.txt
  - all I wanted to accomplish with the patch that started this threas
  was;  a) indicate that the 4K option is no longer a debug thing  and
 
 Very ACK.
 
  b) make 4K stacks the default option in vanilla kernel.org kernels as
  a gentle nudge towards getting people to start fixing the code paths
  that are not 4K stack safe.
 
 That's the big NACK. It's OK for MM, where things are supposed to be in a 
 not well-tested state, but for running possibly mission-critical systems,
 you should take no risk.

Mission-critical machines are not supposed to have kernel configured
with incompetent/careless sysadmin who didn't think about
config choices he made at kernel build time.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] vsprintf.c: optimizing, part 2: base 10 conversion speedup, v2

2007-07-11 Thread Denis Vlasenko
On Thursday 05 July 2007 21:34, Andrew Morton wrote:
> On Thu, 5 Jul 2007 12:51:52 +0200
> Denis Vlasenko <[EMAIL PROTECTED]> wrote:
> 
> > Using code from
> > 
> > http://www.cs.uiowa.edu/~jones/bcd/decimal.html
> > (with permission from the author, Douglas W. Jones)
> 
> Neither of your patches had signed-off-by:s.  Would prefer that they were
> included please, given that we're adding stuff from someone's website.

Sorry. Consider this added to both patches:

Signed-off-by: Denys Vlasenko <[EMAIL PROTECTED]>

Yes. "Denys" is how Ukrainian bureaucracy insists on spelling my name.
This Signed-off-by thing is official, so here it goes.
Informally please use "Denis".
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] vsprintf.c: optimizing, part 2: base 10 conversion speedup, v2

2007-07-11 Thread Denis Vlasenko
On Thursday 05 July 2007 21:34, Andrew Morton wrote:
 On Thu, 5 Jul 2007 12:51:52 +0200
 Denis Vlasenko [EMAIL PROTECTED] wrote:
 
  Using code from
  
  http://www.cs.uiowa.edu/~jones/bcd/decimal.html
  (with permission from the author, Douglas W. Jones)
 
 Neither of your patches had signed-off-by:s.  Would prefer that they were
 included please, given that we're adding stuff from someone's website.

Sorry. Consider this added to both patches:

Signed-off-by: Denys Vlasenko [EMAIL PROTECTED]

Yes. Denys is how Ukrainian bureaucracy insists on spelling my name.
This Signed-off-by thing is official, so here it goes.
Informally please use Denis.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kill -9?

2007-07-06 Thread Denis Vlasenko
On Friday 06 July 2007 08:35, Jesper Juhl wrote:
> On 06/07/07, Kaleem Khan <[EMAIL PROTECTED]> wrote:
> > Hello Kernel experts,
> >
> > I'd like to know whether there's a way to take some action (say
> > calling a routine) in
> > response to 'kill -9' before the process is terminated. I tend to
> > think it's against 'kill -9'
> > UNIX/Linux philosophy but still I'd like to confirm.
> >
> You can't catch/block SIGKILL (9), but you can catch SIGTERM (15 -
> what kill sends by default).
> 
> A well behaved app should catch SIGTERM and do proper cleanup before
> shutdown so that when a user does  kill   it shuts down
> cleanly.  kill -9  shouldn't normally be needed - it is
> for emergency termination of the app, which is why you can't catch it.

Tell that to Oracle. They believe that they are above any rules
and conventions. TERM does not terminate oracle db.

I tried to explain to Oracle DBAs I met how terribly wrong is it.
Quite frustrating experience.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kill -9?

2007-07-06 Thread Denis Vlasenko
On Friday 06 July 2007 08:35, Jesper Juhl wrote:
 On 06/07/07, Kaleem Khan [EMAIL PROTECTED] wrote:
  Hello Kernel experts,
 
  I'd like to know whether there's a way to take some action (say
  calling a routine) in
  response to 'kill -9' before the process is terminated. I tend to
  think it's against 'kill -9'
  UNIX/Linux philosophy but still I'd like to confirm.
 
 You can't catch/block SIGKILL (9), but you can catch SIGTERM (15 -
 what kill sends by default).
 
 A well behaved app should catch SIGTERM and do proper cleanup before
 shutdown so that when a user does  kill pid_of_app  it shuts down
 cleanly.  kill -9 pid_of_app shouldn't normally be needed - it is
 for emergency termination of the app, which is why you can't catch it.

Tell that to Oracle. They believe that they are above any rules
and conventions. TERM does not terminate oracle db.

I tried to explain to Oracle DBAs I met how terribly wrong is it.
Quite frustrating experience.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Some love to default profiler

2007-07-05 Thread Denis Vlasenko
On Thursday 05 July 2007 01:50, Jesper Juhl wrote:
> > Removes conditional branch from schedule(). Code savings on my
> >usual config:
> >
> >textdata bss dec hex filename
> > 2921871  179895  180224 3281990  321446 vmlinux before
> > 2920141  179847  180224 3280212  320d54 vmlinux after
> > --
> >   -1730 -48   -1778
>
> Nice savings there. Not that 1.7K is huge, but it's kernel memory is
> precious :-)

Hehe. In busybox project people can kill for 1.7K :)
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Some love to default profiler

2007-07-05 Thread Denis Vlasenko
On Thursday 05 July 2007 01:50, Jesper Juhl wrote:
  Removes conditional branch from schedule(). Code savings on my
 usual config:
 
 textdata bss dec hex filename
  2921871  179895  180224 3281990  321446 vmlinux before
  2920141  179847  180224 3280212  320d54 vmlinux after
  --
-1730 -48   -1778

 Nice savings there. Not that 1.7K is huge, but it's kernel memory is
 precious :-)

Hehe. In busybox project people can kill for 1.7K :)
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kconfig .po files in kernel tree? [Was: Documentation/HOWTO translated into Japanese]

2007-06-11 Thread Denis Vlasenko
On Monday 11 June 2007 02:56, Paul Mundt wrote:
> On Mon, Jun 11, 2007 at 01:59:00AM +0200, Denis Vlasenko wrote:
> > On Sunday 10 June 2007 20:58, Rene Herman wrote:
> > > All that stuff only serves to multiply the speed at which a fixed
> > > percentage of content obsoletes itself. When it's still new and
> > > shiny, sure, stuff will get translated but in no time at all it'll
> > > become a fragmented mess which nobody ever feels right about removing
> > > because that would be anti-social to all those poor non-english
> > > speaking kernel hackers out there.
> > 
> > I agree. i18n efforts won't help one iota because people just have
> > to know English in order to participate in l-k development.
> 
> That's a ridiculous statement. Non-native language abilities and
> technical competence have very little to do with each other. People have
> to understand the code and figure out what it is that they want to
> change. As long as this is done cleanly and the intent is obvious,
> language doesn't even factor in beyond the Signed-off-by tag. Explanation
> is necessary from time to time, but it really depends on the area in
> which someone is working. If it's a complicated and involved change, then
> of course it takes a bit more effort on both sides, but that doesn't
> invalidate the importance or necessity of the work.

Point me to one person who doesn't know English at all
and who has successfully participated in l-k devel.

I'm not saying that non-English should banned or something.
In Kconfig it can even make sense. A section on kernel.org
where people can put translations is also a good idea.
I can still think that it is almost useless activity,
but who knows, maybe I'm wrong.

Just not Documentation//* thing and no i18n of printks.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kconfig .po files in kernel tree? [Was: Documentation/HOWTO translated into Japanese]

2007-06-11 Thread Denis Vlasenko
On Monday 11 June 2007 02:56, Paul Mundt wrote:
 On Mon, Jun 11, 2007 at 01:59:00AM +0200, Denis Vlasenko wrote:
  On Sunday 10 June 2007 20:58, Rene Herman wrote:
   All that stuff only serves to multiply the speed at which a fixed
   percentage of content obsoletes itself. When it's still new and
   shiny, sure, stuff will get translated but in no time at all it'll
   become a fragmented mess which nobody ever feels right about removing
   because that would be anti-social to all those poor non-english
   speaking kernel hackers out there.
  
  I agree. i18n efforts won't help one iota because people just have
  to know English in order to participate in l-k development.
 
 That's a ridiculous statement. Non-native language abilities and
 technical competence have very little to do with each other. People have
 to understand the code and figure out what it is that they want to
 change. As long as this is done cleanly and the intent is obvious,
 language doesn't even factor in beyond the Signed-off-by tag. Explanation
 is necessary from time to time, but it really depends on the area in
 which someone is working. If it's a complicated and involved change, then
 of course it takes a bit more effort on both sides, but that doesn't
 invalidate the importance or necessity of the work.

Point me to one person who doesn't know English at all
and who has successfully participated in l-k devel.

I'm not saying that non-English should banned or something.
In Kconfig it can even make sense. A section on kernel.org
where people can put translations is also a good idea.
I can still think that it is almost useless activity,
but who knows, maybe I'm wrong.

Just not Documentation/lang/* thing and no i18n of printks.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kconfig .po files in kernel tree? [Was: Documentation/HOWTO translated into Japanese]

2007-06-10 Thread Denis Vlasenko
On Sunday 10 June 2007 20:58, Rene Herman wrote:
> All that stuff only serves to multiply the speed at which a fixed percentage 
> of content obsoletes itself. When it's still new and shiny, sure, stuff will 
> get translated but in no time at all it'll become a fragmented mess which 
> nobody ever feels right about removing because that would be anti-social to 
> all those poor non-english speaking kernel hackers out there.

I agree. i18n efforts won't help one iota because people just have
to know English in order to participate in l-k development.
They should be able to read _and_ reply_ to lkml posts,
and read and understnd code _and_ comments_.

Those who cannot participate in development because they don't
know English, won't get much help from some bits of semi-obsolete
Documentation/* being available. Ok, they will read it, then what?
How they are supposed to read the code? Write email? etc...

There is only one practical solution: learn the language.

It's not about *English* per se. It just happened so historically
that CS has originated in English speaking countries.

BTW, I learned it by reading sci-fi (Asimov's Foundation was the first thing),
and then lkml. :)
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kconfig .po files in kernel tree? [Was: Documentation/HOWTO translated into Japanese]

2007-06-10 Thread Denis Vlasenko
On Sunday 10 June 2007 20:58, Rene Herman wrote:
 All that stuff only serves to multiply the speed at which a fixed percentage 
 of content obsoletes itself. When it's still new and shiny, sure, stuff will 
 get translated but in no time at all it'll become a fragmented mess which 
 nobody ever feels right about removing because that would be anti-social to 
 all those poor non-english speaking kernel hackers out there.

I agree. i18n efforts won't help one iota because people just have
to know English in order to participate in l-k development.
They should be able to read _and_ reply_ to lkml posts,
and read and understnd code _and_ comments_.

Those who cannot participate in development because they don't
know English, won't get much help from some bits of semi-obsolete
Documentation/* being available. Ok, they will read it, then what?
How they are supposed to read the code? Write email? etc...

There is only one practical solution: learn the language.

It's not about *English* per se. It just happened so historically
that CS has originated in English speaking countries.

BTW, I learned it by reading sci-fi (Asimov's Foundation was the first thing),
and then lkml. :)
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG in 2.6.22-rc2-mm1: NIC module b44.c broken (Broadcom 4400)

2007-05-27 Thread Denis Vlasenko
On Thursday 24 May 2007 21:56, Uwe Bugla wrote:
> Please note:
> 
> 1. IRQ 255 looks very idiotic, doesn't it? It does not exist at all, does it?
> 
> Questions:
> 
> 1. What is the technical need / progress of module ssb please?
> 
> 2. If Andrew Morton's guidelines clearly say: "Do test your patches on three 
> different machines" and this guideline seems to be strictly ignored by some 
> sparetime hackers:
> 
> What is the master plan then to avoid the fact that such a crap is being sent 
> in to Andrew?
> 
> Yours sincerely
> 
> Uwe
> 
> P. S.: There is an important saying going like this:
> 
> Too many cooks do mess up the pap.
> 
> Regarding the patch in mm-tree I can see SIX (!) Copyright owners.
> The last one of them (i. e. the one of 2007) obviuosly does not seem to 
> understand what he is doing (see that nonsense interrupt please, just 
> incredible!) :(
> 
> In so far I would deeply appreciate Andrew Morton to throw that b44.c patch 
> into the trashbox as soon as possible :)

Uwe, you are an arrogant idiot and I think it's best
for everybody to just ignore all your mails, regardless
of their technical merits.

Even if your mail reports a real bug, added shitload of insults
to developers far outweights any possible useful info.

Developers can save a lot of time and nerver by just waiting for
someone else to hit the same bug, if it exists, and then debug it
as usual.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG in 2.6.22-rc2-mm1: NIC module b44.c broken (Broadcom 4400)

2007-05-27 Thread Denis Vlasenko
On Thursday 24 May 2007 21:56, Uwe Bugla wrote:
 Please note:
 
 1. IRQ 255 looks very idiotic, doesn't it? It does not exist at all, does it?
 
 Questions:
 
 1. What is the technical need / progress of module ssb please?
 
 2. If Andrew Morton's guidelines clearly say: Do test your patches on three 
 different machines and this guideline seems to be strictly ignored by some 
 sparetime hackers:
 
 What is the master plan then to avoid the fact that such a crap is being sent 
 in to Andrew?
 
 Yours sincerely
 
 Uwe
 
 P. S.: There is an important saying going like this:
 
 Too many cooks do mess up the pap.
 
 Regarding the patch in mm-tree I can see SIX (!) Copyright owners.
 The last one of them (i. e. the one of 2007) obviuosly does not seem to 
 understand what he is doing (see that nonsense interrupt please, just 
 incredible!) :(
 
 In so far I would deeply appreciate Andrew Morton to throw that b44.c patch 
 into the trashbox as soon as possible :)

Uwe, you are an arrogant idiot and I think it's best
for everybody to just ignore all your mails, regardless
of their technical merits.

Even if your mail reports a real bug, added shitload of insults
to developers far outweights any possible useful info.

Developers can save a lot of time and nerver by just waiting for
someone else to hit the same bug, if it exists, and then debug it
as usual.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Question about Reiser4

2007-04-24 Thread Denis Vlasenko

Theodore Tso wrote:
>
> One of the big problems of using a filesystem as a DB is the system
> call overheads.  If you use huge numbers of tiny files, then each
> attempt read an atom of information from the DB takes three system
> calls --- an open(), read(), and close(), with all of the overheads in
> terms of dentry and inode cache.
>

Now, to be fair, there are probably a number of cases where
open/lseek/readv/close and open/lseek/writev/close would be worth doing
as a single system call.  The big problem as far as I can see involves
EINTR handling; such a system call has serious restartability implications.

Of course, there are Ingo's syslets...


I definitely would like open/readv/close syscall a lot.
Actually, a set of four syscalls

open/readv/close
open/pread/close
open/writev/close
open/pwrite/close

will allow to reduce syscall overhead for a number of cases.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Question about Reiser4

2007-04-24 Thread Denis Vlasenko

Theodore Tso wrote:

 One of the big problems of using a filesystem as a DB is the system
 call overheads.  If you use huge numbers of tiny files, then each
 attempt read an atom of information from the DB takes three system
 calls --- an open(), read(), and close(), with all of the overheads in
 terms of dentry and inode cache.


Now, to be fair, there are probably a number of cases where
open/lseek/readv/close and open/lseek/writev/close would be worth doing
as a single system call.  The big problem as far as I can see involves
EINTR handling; such a system call has serious restartability implications.

Of course, there are Ingo's syslets...


I definitely would like open/readv/close syscall a lot.
Actually, a set of four syscalls

open/readv/close
open/pread/close
open/writev/close
open/pwrite/close

will allow to reduce syscall overhead for a number of cases.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unable to run busybox /sbin/int

2007-04-21 Thread Denis Vlasenko
Hi Tom.

On Thursday 19 April 2007 21:00, Tom Strader wrote:
> This is the final output from my kernel as I try to launch busybox
> (/sbin/init is linked to /bin/busybox)
> As it launches the kernel looks for libraries which do not exist (not
> sure why), but it appears to find /lib/libcrypt.so.1 and /lib/libc.so.6
> but the system does not output after that.  I can press keys on the
> keyboard and there are echoed to the screen, I can also use the control
> characters C-c, C-s, C-q, and so on and I see kernel messages indication
> the uart_flush_buffer(0) is being called but busybox does not appear to
> start.  Here is my kernel output, any suggestions would help. Thanks.

Ok, here we go again.

Does "hello, world" program works
as init, do you see its output? (init=/path/to/hello_world)

If no: what is your console, serial I think? How do you specify
it on kernel command line?

If yes: does init=/bin/sh work?
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Denis Vlasenko
On Saturday 21 April 2007 18:00, Ingo Molnar wrote:
> correct. Note that Willy reniced X back to 0 so it had no relevance on 
> his test. Also note that i pointed this change out in the -v4 CFS 
> announcement:
> 
> || Changes since -v3:
> ||
> ||  - usability fix: automatic renicing of kernel threads such as 
> ||keventd, OOM tasks and tasks doing privileged hardware access
> ||(such as Xorg).
> 
> i've attached it below in a standalone form, feel free to put it into 
> SD! :)

But X problems have nothing to do with "privileged hardware access".
X problems are related to priority inversions between server and client
processes, and "one server process - many client processes" case.

I think syncronous nature of Xlib (clients cannot fire-and-forget
their commands to X server, with Xlib each command waits for ACK
from server) also add some amount of pain.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Denis Vlasenko
On Saturday 21 April 2007 18:00, Ingo Molnar wrote:
 correct. Note that Willy reniced X back to 0 so it had no relevance on 
 his test. Also note that i pointed this change out in the -v4 CFS 
 announcement:
 
 || Changes since -v3:
 ||
 ||  - usability fix: automatic renicing of kernel threads such as 
 ||keventd, OOM tasks and tasks doing privileged hardware access
 ||(such as Xorg).
 
 i've attached it below in a standalone form, feel free to put it into 
 SD! :)

But X problems have nothing to do with privileged hardware access.
X problems are related to priority inversions between server and client
processes, and one server process - many client processes case.

I think syncronous nature of Xlib (clients cannot fire-and-forget
their commands to X server, with Xlib each command waits for ACK
from server) also add some amount of pain.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: unable to run busybox /sbin/int

2007-04-21 Thread Denis Vlasenko
Hi Tom.

On Thursday 19 April 2007 21:00, Tom Strader wrote:
 This is the final output from my kernel as I try to launch busybox
 (/sbin/init is linked to /bin/busybox)
 As it launches the kernel looks for libraries which do not exist (not
 sure why), but it appears to find /lib/libcrypt.so.1 and /lib/libc.so.6
 but the system does not output after that.  I can press keys on the
 keyboard and there are echoed to the screen, I can also use the control
 characters C-c, C-s, C-q, and so on and I see kernel messages indication
 the uart_flush_buffer(0) is being called but busybox does not appear to
 start.  Here is my kernel output, any suggestions would help. Thanks.

Ok, here we go again.

Does hello, world program works
as init, do you see its output? (init=/path/to/hello_world)

If no: what is your console, serial I think? How do you specify
it on kernel command line?

If yes: does init=/bin/sh work?
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Upgraded to 2.6.20.7 - positives

2007-04-18 Thread Denis Vlasenko
Hi kernel people,

Just upgraded by home box to 2.6.20.7. Wow.

* Reiser3 mount times are drastically reduced,
  even when journal replay is needed
  (I have few 100Gb+ reiser3 partitions mounted at boot)
* sit pseudo-interface is gone. In previous kernel, I tried
  to disable it in kernel config to no avial. Now it was easy
  to simply compile it as a module.
* From make menuconfig questions it looks like SATA/PATA
  rewrite (in the form of libata) is almost finished. Hehe,
  untangling IDE mess was quite a feat, and Jeff did it. Kudos.

Need to check now whether losetup oopses are gone too,
or hunt them down if they are still with us :)

Thanks everybody for your amazing work.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Upgraded to 2.6.20.7 - positives

2007-04-18 Thread Denis Vlasenko
Hi kernel people,

Just upgraded by home box to 2.6.20.7. Wow.

* Reiser3 mount times are drastically reduced,
  even when journal replay is needed
  (I have few 100Gb+ reiser3 partitions mounted at boot)
* sit pseudo-interface is gone. In previous kernel, I tried
  to disable it in kernel config to no avial. Now it was easy
  to simply compile it as a module.
* From make menuconfig questions it looks like SATA/PATA
  rewrite (in the form of libata) is almost finished. Hehe,
  untangling IDE mess was quite a feat, and Jeff did it. Kudos.

Need to check now whether losetup oopses are gone too,
or hunt them down if they are still with us :)

Thanks everybody for your amazing work.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/5] signalfd v2 - signalfd core ...

2007-03-30 Thread Denis Vlasenko
On Thursday 08 March 2007 18:28, Linus Torvalds wrote:
> The sad part is that there really is no reason why the BSD crowd couldn't 
> have done recvmsg() as an "extended read with per-system call flags", 
> which would have made things like O_NONBLOCK etc unnecessary, because you 
> could do it just with MSG_DONTWAIT..

Wait a second here... O_NONBLOCK is not just unnecessary - it's buggy!

Try to do nonblocking read from stdin (fd #0) -
* setting O_NONBLOCK with fcntl will set it for all other processes
  which has the same stdin!
* trying to reset O_NONBLOCK after the read doesn't help (think kill -9)
* duping fd #0 doesn't help because O_NONBLOCK is not per-fd,
  it's shared just like filepos.

I really like that trick with recvmsg + MSG_DONTWAIT instead.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/5] signalfd v2 - signalfd core ...

2007-03-30 Thread Denis Vlasenko
On Thursday 08 March 2007 18:28, Linus Torvalds wrote:
 The sad part is that there really is no reason why the BSD crowd couldn't 
 have done recvmsg() as an extended read with per-system call flags, 
 which would have made things like O_NONBLOCK etc unnecessary, because you 
 could do it just with MSG_DONTWAIT..

Wait a second here... O_NONBLOCK is not just unnecessary - it's buggy!

Try to do nonblocking read from stdin (fd #0) -
* setting O_NONBLOCK with fcntl will set it for all other processes
  which has the same stdin!
* trying to reset O_NONBLOCK after the read doesn't help (think kill -9)
* duping fd #0 doesn't help because O_NONBLOCK is not per-fd,
  it's shared just like filepos.

I really like that trick with recvmsg + MSG_DONTWAIT instead.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: whence CONFIG_PROVE_SPIN_LOCKING?

2007-03-18 Thread Denis Vlasenko
Hi,

On Sunday 18 March 2007 22:06, Robert P. J. Day wrote:
> p.s.  just FYI, i ran my "find dead CONFIG variables" script on the
> entire tree and, as we speak, there are 316 preprocessor tests that
> are testing variables of the form "CONFIG_whatever" for which that
> option is not set anywhere in the tree.  (that is, 316 distinct
> variables, not just 316 distinct tests.)  see the attached script and
> feel free to run it from the top of the tree on your favourite
> directory or sub-directory.

In busybox project we adopted the usage of -Wundef
and we try to minimize usage of #ifdef CONFIG_xxx - each boolean
CONFIG_xxx option for busybox is accompanied with
ENABLE_xxx #define which is 1 or 0, never "undefined",
and we check it instead of CONFIG_xxx.

Because if -Wundef, gcc complains whenever we use #if
on undefined ENABLE_xxx.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: whence CONFIG_PROVE_SPIN_LOCKING?

2007-03-18 Thread Denis Vlasenko
Hi,

On Sunday 18 March 2007 22:06, Robert P. J. Day wrote:
 p.s.  just FYI, i ran my find dead CONFIG variables script on the
 entire tree and, as we speak, there are 316 preprocessor tests that
 are testing variables of the form CONFIG_whatever for which that
 option is not set anywhere in the tree.  (that is, 316 distinct
 variables, not just 316 distinct tests.)  see the attached script and
 feel free to run it from the top of the tree on your favourite
 directory or sub-directory.

In busybox project we adopted the usage of -Wundef
and we try to minimize usage of #ifdef CONFIG_xxx - each boolean
CONFIG_xxx option for busybox is accompanied with
ENABLE_xxx #define which is 1 or 0, never undefined,
and we check it instead of CONFIG_xxx.

Because if -Wundef, gcc complains whenever we use #if
on undefined ENABLE_xxx.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem: cat < /dev/my_ttyS0 is not blocked

2007-03-10 Thread Denis Vlasenko
On Saturday 10 March 2007 13:16, Mockern wrote:
> I have a problem with  cat < /dev/my_ttyS0 (see strace output below).
> cat function is not blocked. I don't understand why it is not stopped
> at read(0, __  and terminated?  
> Thank you

Because /dev/my_ttyS0 is probaly a null file.

Please show output of 'ls -l /dev/*ttyS*'

--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem: cat /dev/my_ttyS0 is not blocked

2007-03-10 Thread Denis Vlasenko
On Saturday 10 March 2007 13:16, Mockern wrote:
 I have a problem with  cat  /dev/my_ttyS0 (see strace output below).
 cat function is not blocked. I don't understand why it is not stopped
 at read(0, __  and terminated?  
 Thank you

Because /dev/my_ttyS0 is probaly a null file.

Please show output of 'ls -l /dev/*ttyS*'

--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_NONBLOCK setting "leak" outside of a process??

2007-02-03 Thread Denis Vlasenko
On Sunday 04 February 2007 01:55, David Schwartz wrote:
> 
> > That's a bug, right? I couldn't find anything to that effect in IEEE
> > Std. 1003.1, 2004 Edition...
> >
> > Ciao,
> >  Roland
> 
> It's not a bug, there's no rational alternative. What would two indepedent
> file descriptors for the same end of a TCP connection be?

Easy. O_NONBLOCK should only affect whether read/write blocks or
returns EAGAIN. It's logical for this setting to be per-process.

Currently changing O_NONBLOCK on stdin/out/err affects other,
possibly unrelated processes - they don't expect that *their*
reads/writes will start returning EAGAIN!

Worse, it cannot be worked around by dup() because duped fds
are still sharing O_NONBLOCK. How can I work around this?
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_NONBLOCK setting leak outside of a process??

2007-02-03 Thread Denis Vlasenko
On Sunday 04 February 2007 01:55, David Schwartz wrote:
 
  That's a bug, right? I couldn't find anything to that effect in IEEE
  Std. 1003.1, 2004 Edition...
 
  Ciao,
   Roland
 
 It's not a bug, there's no rational alternative. What would two indepedent
 file descriptors for the same end of a TCP connection be?

Easy. O_NONBLOCK should only affect whether read/write blocks or
returns EAGAIN. It's logical for this setting to be per-process.

Currently changing O_NONBLOCK on stdin/out/err affects other,
possibly unrelated processes - they don't expect that *their*
reads/writes will start returning EAGAIN!

Worse, it cannot be worked around by dup() because duped fds
are still sharing O_NONBLOCK. How can I work around this?
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_NONBLOCK setting "leak" outside of a process??

2007-02-01 Thread Denis Vlasenko
On Tuesday 30 January 2007 04:40, Philippe Troin wrote:
> > int main() {
> > fcntl(0, F_SETFL, fcntl(0, F_GETFL, 0) | O_NONBLOCK);
> > return 0;
> > }
> > 
> > int main() {
> > fcntl(0, F_SETFL, fcntl(0, F_GETFL, 0) & ~O_NONBLOCK);
> > return 0;
> > }
> > 
> > If I run "nonblock" in Midnight Commander in KDE's Konsole,
> > screen redraw starts to work ~5 times slower. For example,
> > Ctrl-O ("show/hide panels" in MC) takes ~0.5 sec to redraw.
> > This persists after the program exits (which it
> > does immediately as you see).
> > Running "block" reverts things to normal.
> > 
> > I mean: how can O_NONBLOCK _issued in a process which
> > already exited_ have any effect whatsoever on MC or Konsole?
> > They can't even know that it did it, right?
> > 
> > Either I do not know something subtle about Unix or some sort
> > of bug is at work.
> 
> Because they all share the same stdin file descriptor, therefore they
> share the same file descriptor flags?

What share the same file descriptor? MC and programs started from it?

I thought after exec() fds atre either closed (if CLOEXEC) or
becoming independent from parent process
(i.e. it you seek, close, etc your fd, parent would not notice that).

Am I wrong?
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_NONBLOCK setting leak outside of a process??

2007-02-01 Thread Denis Vlasenko
On Tuesday 30 January 2007 04:40, Philippe Troin wrote:
  int main() {
  fcntl(0, F_SETFL, fcntl(0, F_GETFL, 0) | O_NONBLOCK);
  return 0;
  }
  
  int main() {
  fcntl(0, F_SETFL, fcntl(0, F_GETFL, 0)  ~O_NONBLOCK);
  return 0;
  }
  
  If I run nonblock in Midnight Commander in KDE's Konsole,
  screen redraw starts to work ~5 times slower. For example,
  Ctrl-O (show/hide panels in MC) takes ~0.5 sec to redraw.
  This persists after the program exits (which it
  does immediately as you see).
  Running block reverts things to normal.
  
  I mean: how can O_NONBLOCK _issued in a process which
  already exited_ have any effect whatsoever on MC or Konsole?
  They can't even know that it did it, right?
  
  Either I do not know something subtle about Unix or some sort
  of bug is at work.
 
 Because they all share the same stdin file descriptor, therefore they
 share the same file descriptor flags?

What share the same file descriptor? MC and programs started from it?

I thought after exec() fds atre either closed (if CLOEXEC) or
becoming independent from parent process
(i.e. it you seek, close, etc your fd, parent would not notice that).

Am I wrong?
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-29 Thread Denis Vlasenko
On Monday 29 January 2007 18:00, Andrea Arcangeli wrote:
> On Sun, Jan 28, 2007 at 06:03:08PM +0100, Denis Vlasenko wrote:
> > I still don't see much difference between O_SYNC and O_DIRECT write
> > semantic.
> 
> O_DIRECT is about avoiding the copy_user between cache and userland,
> when working with devices that runs faster than ram (think >=100M/sec,
> quite standard hardware unless you've only a desktop or you cannot
> afford raid).

Yes, I know that, but O_DIRECT is also "overloaded" with
O_SYNC-like semantic too ("write doesnt return until data hits
physical media"). To have two ortogonal things "mixed together"
in one flag feels "not Unixy" to me. So I am trying to formulate
saner semantic. So far I think that this looks good:

O_SYNC - usual meaning
O_STREAM - do not try hard to cache me. This includes "if you can
(buffer is sufficiently aligned, yadda, yadda), do not
copy_user into pagecache but just DMA from userspace
pages" - exactly because user told us that he is not
interested in caching!

Then O_DIRECT is approximately = O_SYNC + O_STREAM, and I think
maybe Linus will not hate this "new" O_DIRECT - it doesn't
bypass pagecache.

> O_SYNC is about working around buggy or underperforming VM growing the
> dirty levels beyond optimal levels, or to open logfiles that you want
> to save to disk ASAP (most other journaling usages are better done
> with fsync instead).

I've got a feeling that db people use O_DIRECT (its O_SYNCy behaviour)
as a poor man's write barrier when they must be sure that their redo
logs have hit storage before they start to modify datafiles.
Another reason why they want sync writes is write error detection.
They cannot afford delaying it.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-29 Thread Denis Vlasenko
On Monday 29 January 2007 18:00, Andrea Arcangeli wrote:
 On Sun, Jan 28, 2007 at 06:03:08PM +0100, Denis Vlasenko wrote:
  I still don't see much difference between O_SYNC and O_DIRECT write
  semantic.
 
 O_DIRECT is about avoiding the copy_user between cache and userland,
 when working with devices that runs faster than ram (think =100M/sec,
 quite standard hardware unless you've only a desktop or you cannot
 afford raid).

Yes, I know that, but O_DIRECT is also overloaded with
O_SYNC-like semantic too (write doesnt return until data hits
physical media). To have two ortogonal things mixed together
in one flag feels not Unixy to me. So I am trying to formulate
saner semantic. So far I think that this looks good:

O_SYNC - usual meaning
O_STREAM - do not try hard to cache me. This includes if you can
(buffer is sufficiently aligned, yadda, yadda), do not
copy_user into pagecache but just DMA from userspace
pages - exactly because user told us that he is not
interested in caching!

Then O_DIRECT is approximately = O_SYNC + O_STREAM, and I think
maybe Linus will not hate this new O_DIRECT - it doesn't
bypass pagecache.

 O_SYNC is about working around buggy or underperforming VM growing the
 dirty levels beyond optimal levels, or to open logfiles that you want
 to save to disk ASAP (most other journaling usages are better done
 with fsync instead).

I've got a feeling that db people use O_DIRECT (its O_SYNCy behaviour)
as a poor man's write barrier when they must be sure that their redo
logs have hit storage before they start to modify datafiles.
Another reason why they want sync writes is write error detection.
They cannot afford delaying it.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-28 Thread Denis Vlasenko
On Sunday 28 January 2007 16:30, Bill Davidsen wrote:
> Denis Vlasenko wrote:
> > On Saturday 27 January 2007 15:01, Bodo Eggert wrote:
> >> Denis Vlasenko <[EMAIL PROTECTED]> wrote:
> >>> On Friday 26 January 2007 19:23, Bill Davidsen wrote:
> >>>> Denis Vlasenko wrote:
> >>>>> On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
> >>>>>> But even single-threaded I/O but in large quantities benefits from
> >>>>>> O_DIRECT significantly, and I pointed this out before.
> >>>>> Which shouldn't be true. There is no fundamental reason why
> >>>>> ordinary writes should be slower than O_DIRECT.
> >>>>>
> >>>> Other than the copy to buffer taking CPU and memory resources.
> >>> It is not required by any standard that I know. Kernel can be smarter
> >>> and avoid that if it can.
> >> The kernel can also solve the halting problem if it can.
> >>
> >> Do you really think an entropy estamination code on all access patterns in 
> >> the
> >> system will be free as in beer,
> > 
> > Actually I think we need this heuristic:
> > 
> > if (opened_with_O_STREAM && buffer_is_aligned
> > && io_size_is_a_multiple_of_sectorsize)
> > do_IO_directly_to_user_buffer_without_memcpy
> > 
> > is not *that* compilcated.
> > 
> > I think that we can get rid of O_DIRECT peculiar requirements
> > "you *must* not cache me" + "you *must* write me directly to bare metal"
> > by replacing it with O_STREAM ("*advice* to not cache me") + O_SYNC
> > ("write() should return only when data is written to storage, not sooner").
> > 
> > Why?
> > 
> > Because these O_DIRECT "musts" are rather unusual and overkill. Apps
> > should not have that much control over what kernel does internally;
> > and also O_DIRECT was mixing shampoo and conditioner on one bottle
> > (no-cache and sync writes) - bad API.
> 
> What a shame that other operating systems can manage to really support 
> O_DIRECT, and that major application software can use this api to write 
> portable code that works even on Windows.
> 
> You overlooked the problem that applications using this api assume that 
> reads are on bare metal as well, how do you address the case where 
> thread A does a write, thread B does a read? If you give thread B data 
> from a buffer and it then does a write to another file (which completes 
> before the write from thread A), and then the system crashes, you have 
> just put the files out of sync.

Applications which syncronize their data integrity
by keeping data on hard drive and relying on
"read goes to bare metal, so it can't see written data
before it gets written to bare metal". Wow, this is slow.
Are you talking about this scenario:

Bad:
fd = open(..., O_SYNC);
fork()
write(fd, buf); [1]
   read(fd, buf2); [starts after write 1 started]
   write(somewhere_else, buf2);
   (write returns)
 < crash point
(write returns)

This will be *very* slow - if you use O_DIRECT and do what
is depicted above, you write data, then you read it back,
whic is slow. Why do you want that? Isn't it
much faster to just wait for write to complete, and allow
read to fetch (potentially) cached data?

Better:
fd = open(..., O_SYNC);
fork()
write(fd, buf); [1]
   (wait for write to finish)


 < crash point
(write returns)
   read(fd, buf2); [starts after write 1 started]
   write(somewhere_else, buf2);
   (write returns)

> So you may have to block all i/o for all  
> threads of the application to be sure that doesn't happen.

Not all, only related i/o.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-28 Thread Denis Vlasenko
On Sunday 28 January 2007 16:18, Bill Davidsen wrote:
> Denis Vlasenko wrote:
> > On Friday 26 January 2007 19:23, Bill Davidsen wrote:
> >> Denis Vlasenko wrote:
> >>> On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
> >>>> Phillip Susi wrote:
> 
>   [...]
> 
> >>>> But even single-threaded I/O but in large quantities benefits from 
> >>>> O_DIRECT
> >>>> significantly, and I pointed this out before.
> >>> Which shouldn't be true. There is no fundamental reason why
> >>> ordinary writes should be slower than O_DIRECT.
> >>>
> >> Other than the copy to buffer taking CPU and memory resources.
> > 
> > It is not required by any standard that I know. Kernel can be smarter
> > and avoid that if it can.
> 
> Actually, no, the whole idea of page cache is that overall system i/o 
> can be faster if data sit in the page cache for a while. But the real 
> problem is that the application write is now disconnected from the 
> physical write, both in time and order.

Not in O_SYNC case.

> No standard says the kernel couldn't do direct DMA, but since having 
> that required is needed to guarantee write order and error status linked 
> to the actual application i/o, what a kernel "might do" is irrelevant.
> 
> It's much easier to do O_DIRECT by actually doing the direct i/o than to 
> try to catch all the corner cases which arise in faking it.

I still don't see much difference between O_SYNC and O_DIRECT write
semantic.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-28 Thread Denis Vlasenko
On Sunday 28 January 2007 16:18, Bill Davidsen wrote:
 Denis Vlasenko wrote:
  On Friday 26 January 2007 19:23, Bill Davidsen wrote:
  Denis Vlasenko wrote:
  On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
  Phillip Susi wrote:
 
   [...]
 
  But even single-threaded I/O but in large quantities benefits from 
  O_DIRECT
  significantly, and I pointed this out before.
  Which shouldn't be true. There is no fundamental reason why
  ordinary writes should be slower than O_DIRECT.
 
  Other than the copy to buffer taking CPU and memory resources.
  
  It is not required by any standard that I know. Kernel can be smarter
  and avoid that if it can.
 
 Actually, no, the whole idea of page cache is that overall system i/o 
 can be faster if data sit in the page cache for a while. But the real 
 problem is that the application write is now disconnected from the 
 physical write, both in time and order.

Not in O_SYNC case.

 No standard says the kernel couldn't do direct DMA, but since having 
 that required is needed to guarantee write order and error status linked 
 to the actual application i/o, what a kernel might do is irrelevant.
 
 It's much easier to do O_DIRECT by actually doing the direct i/o than to 
 try to catch all the corner cases which arise in faking it.

I still don't see much difference between O_SYNC and O_DIRECT write
semantic.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-28 Thread Denis Vlasenko
On Sunday 28 January 2007 16:30, Bill Davidsen wrote:
 Denis Vlasenko wrote:
  On Saturday 27 January 2007 15:01, Bodo Eggert wrote:
  Denis Vlasenko [EMAIL PROTECTED] wrote:
  On Friday 26 January 2007 19:23, Bill Davidsen wrote:
  Denis Vlasenko wrote:
  On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
  But even single-threaded I/O but in large quantities benefits from
  O_DIRECT significantly, and I pointed this out before.
  Which shouldn't be true. There is no fundamental reason why
  ordinary writes should be slower than O_DIRECT.
 
  Other than the copy to buffer taking CPU and memory resources.
  It is not required by any standard that I know. Kernel can be smarter
  and avoid that if it can.
  The kernel can also solve the halting problem if it can.
 
  Do you really think an entropy estamination code on all access patterns in 
  the
  system will be free as in beer,
  
  Actually I think we need this heuristic:
  
  if (opened_with_O_STREAM  buffer_is_aligned
   io_size_is_a_multiple_of_sectorsize)
  do_IO_directly_to_user_buffer_without_memcpy
  
  is not *that* compilcated.
  
  I think that we can get rid of O_DIRECT peculiar requirements
  you *must* not cache me + you *must* write me directly to bare metal
  by replacing it with O_STREAM (*advice* to not cache me) + O_SYNC
  (write() should return only when data is written to storage, not sooner).
  
  Why?
  
  Because these O_DIRECT musts are rather unusual and overkill. Apps
  should not have that much control over what kernel does internally;
  and also O_DIRECT was mixing shampoo and conditioner on one bottle
  (no-cache and sync writes) - bad API.
 
 What a shame that other operating systems can manage to really support 
 O_DIRECT, and that major application software can use this api to write 
 portable code that works even on Windows.
 
 You overlooked the problem that applications using this api assume that 
 reads are on bare metal as well, how do you address the case where 
 thread A does a write, thread B does a read? If you give thread B data 
 from a buffer and it then does a write to another file (which completes 
 before the write from thread A), and then the system crashes, you have 
 just put the files out of sync.

Applications which syncronize their data integrity
by keeping data on hard drive and relying on
read goes to bare metal, so it can't see written data
before it gets written to bare metal. Wow, this is slow.
Are you talking about this scenario:

Bad:
fd = open(..., O_SYNC);
fork()
write(fd, buf); [1]
   read(fd, buf2); [starts after write 1 started]
   write(somewhere_else, buf2);
   (write returns)
  crash point
(write returns)

This will be *very* slow - if you use O_DIRECT and do what
is depicted above, you write data, then you read it back,
whic is slow. Why do you want that? Isn't it
much faster to just wait for write to complete, and allow
read to fetch (potentially) cached data?

Better:
fd = open(..., O_SYNC);
fork()
write(fd, buf); [1]
   (wait for write to finish)


  crash point
(write returns)
   read(fd, buf2); [starts after write 1 started]
   write(somewhere_else, buf2);
   (write returns)

 So you may have to block all i/o for all  
 threads of the application to be sure that doesn't happen.

Not all, only related i/o.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


O_NONBLOCK setting "leak" outside of a process??

2007-01-27 Thread Denis Vlasenko
Hi,

I am currently on Linux 2.6.18, x86_64.
I came across strange behavior while working on one
of busybox applets. I narrowed it down to these two
trivial testcases:

#include 
#include 
int main() {
fcntl(0, F_SETFL, fcntl(0, F_GETFL, 0) | O_NONBLOCK);
return 0;
}

#include 
#include 
int main() {
fcntl(0, F_SETFL, fcntl(0, F_GETFL, 0) & ~O_NONBLOCK);
return 0;
}

If I run "nonblock" in Midnight Commander in KDE's Konsole,
screen redraw starts to work ~5 times slower. For example,
Ctrl-O ("show/hide panels" in MC) takes ~0.5 sec to redraw.
This persists after the program exist (which it
does immediately as you see).
Running "block" reverts things to normal.

I mean: how can O_NONBLOCK _issued in a process which
already exited_ have any effect whatsoever on MC or Konsole?
They can't even know that it did it, right?

Either I do not know something subtle about Unix or some sort
of bug is at work.

Any advice?
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-27 Thread Denis Vlasenko
On Saturday 27 January 2007 15:01, Bodo Eggert wrote:
> Denis Vlasenko <[EMAIL PROTECTED]> wrote:
> > On Friday 26 January 2007 19:23, Bill Davidsen wrote:
> >> Denis Vlasenko wrote:
> >> > On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
> 
> >> >> But even single-threaded I/O but in large quantities benefits from
> >> >> O_DIRECT significantly, and I pointed this out before.
> >> > 
> >> > Which shouldn't be true. There is no fundamental reason why
> >> > ordinary writes should be slower than O_DIRECT.
> >> > 
> >> Other than the copy to buffer taking CPU and memory resources.
> > 
> > It is not required by any standard that I know. Kernel can be smarter
> > and avoid that if it can.
> 
> The kernel can also solve the halting problem if it can.
> 
> Do you really think an entropy estamination code on all access patterns in the
> system will be free as in beer,

Actually I think we need this heuristic:

if (opened_with_O_STREAM && buffer_is_aligned
&& io_size_is_a_multiple_of_sectorsize)
do_IO_directly_to_user_buffer_without_memcpy

is not *that* compilcated.

I think that we can get rid of O_DIRECT peculiar requirements
"you *must* not cache me" + "you *must* write me directly to bare metal"
by replacing it with O_STREAM ("*advice* to not cache me") + O_SYNC
("write() should return only when data is written to storage, not sooner").

Why?

Because these O_DIRECT "musts" are rather unusual and overkill. Apps
should not have that much control over what kernel does internally;
and also O_DIRECT was mixing shampoo and conditioner on one bottle
(no-cache and sync writes) - bad API.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-27 Thread Denis Vlasenko
On Saturday 27 January 2007 15:01, Bodo Eggert wrote:
 Denis Vlasenko [EMAIL PROTECTED] wrote:
  On Friday 26 January 2007 19:23, Bill Davidsen wrote:
  Denis Vlasenko wrote:
   On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
 
   But even single-threaded I/O but in large quantities benefits from
   O_DIRECT significantly, and I pointed this out before.
   
   Which shouldn't be true. There is no fundamental reason why
   ordinary writes should be slower than O_DIRECT.
   
  Other than the copy to buffer taking CPU and memory resources.
  
  It is not required by any standard that I know. Kernel can be smarter
  and avoid that if it can.
 
 The kernel can also solve the halting problem if it can.
 
 Do you really think an entropy estamination code on all access patterns in the
 system will be free as in beer,

Actually I think we need this heuristic:

if (opened_with_O_STREAM  buffer_is_aligned
 io_size_is_a_multiple_of_sectorsize)
do_IO_directly_to_user_buffer_without_memcpy

is not *that* compilcated.

I think that we can get rid of O_DIRECT peculiar requirements
you *must* not cache me + you *must* write me directly to bare metal
by replacing it with O_STREAM (*advice* to not cache me) + O_SYNC
(write() should return only when data is written to storage, not sooner).

Why?

Because these O_DIRECT musts are rather unusual and overkill. Apps
should not have that much control over what kernel does internally;
and also O_DIRECT was mixing shampoo and conditioner on one bottle
(no-cache and sync writes) - bad API.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


O_NONBLOCK setting leak outside of a process??

2007-01-27 Thread Denis Vlasenko
Hi,

I am currently on Linux 2.6.18, x86_64.
I came across strange behavior while working on one
of busybox applets. I narrowed it down to these two
trivial testcases:

#include unistd.h
#include fcntl.h
int main() {
fcntl(0, F_SETFL, fcntl(0, F_GETFL, 0) | O_NONBLOCK);
return 0;
}

#include unistd.h
#include fcntl.h
int main() {
fcntl(0, F_SETFL, fcntl(0, F_GETFL, 0)  ~O_NONBLOCK);
return 0;
}

If I run nonblock in Midnight Commander in KDE's Konsole,
screen redraw starts to work ~5 times slower. For example,
Ctrl-O (show/hide panels in MC) takes ~0.5 sec to redraw.
This persists after the program exist (which it
does immediately as you see).
Running block reverts things to normal.

I mean: how can O_NONBLOCK _issued in a process which
already exited_ have any effect whatsoever on MC or Konsole?
They can't even know that it did it, right?

Either I do not know something subtle about Unix or some sort
of bug is at work.

Any advice?
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-26 Thread Denis Vlasenko
On Friday 26 January 2007 19:23, Bill Davidsen wrote:
> Denis Vlasenko wrote:
> > On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
> >> Phillip Susi wrote:
> >>> Denis Vlasenko wrote:
> >>>> You mean "You can use aio_write" ?
> >>> Exactly.  You generally don't use O_DIRECT without aio.  Combining the
> >>> two is what gives the big win.
> >> Well, it's not only aio.  Multithreaded I/O also helps alot -- all this,
> >> say, to utilize a raid array with many spindles.
> >>
> >> But even single-threaded I/O but in large quantities benefits from O_DIRECT
> >> significantly, and I pointed this out before.
> > 
> > Which shouldn't be true. There is no fundamental reason why
> > ordinary writes should be slower than O_DIRECT.
> > 
> Other than the copy to buffer taking CPU and memory resources.

It is not required by any standard that I know. Kernel can be smarter
and avoid that if it can.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-26 Thread Denis Vlasenko
On Friday 26 January 2007 18:05, Phillip Susi wrote:
> Denis Vlasenko wrote:
> > Which shouldn't be true. There is no fundamental reason why
> > ordinary writes should be slower than O_DIRECT.
> 
> Again, there IS a reason:  O_DIRECT eliminates the cpu overhead of the 
> kernel-user copy,

You assume that ordinary read()/write() is *required* to do the copying.
It doesn't. Kernel is allowed to do direct DMAing in this case too.

> and when coupled with multithreading or aio, allows  
> the IO queues to be kept full with useful transfers at all times.

Again, ordinary I/O is no different. Especially on fds opened with O_SYNC,
write() will behave very similarly to O_DIRECT one - data is guaranteed
to hit the disk before write() returns.

> Normal read/write requires the kernel to buffer and guess access

No it doesn't *require* that.

> patterns correctly to perform read ahead and write behind perfectly to 
> keep the queues full.  In practice, this does not happen perfectly all 
> of the time, or even most of the time, so it slows things down.

So lets fix the kernel for everyone's benefit intead of "give us
an API specifically for our needs".
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-26 Thread Denis Vlasenko
On Friday 26 January 2007 18:05, Phillip Susi wrote:
 Denis Vlasenko wrote:
  Which shouldn't be true. There is no fundamental reason why
  ordinary writes should be slower than O_DIRECT.
 
 Again, there IS a reason:  O_DIRECT eliminates the cpu overhead of the 
 kernel-user copy,

You assume that ordinary read()/write() is *required* to do the copying.
It doesn't. Kernel is allowed to do direct DMAing in this case too.

 and when coupled with multithreading or aio, allows  
 the IO queues to be kept full with useful transfers at all times.

Again, ordinary I/O is no different. Especially on fds opened with O_SYNC,
write() will behave very similarly to O_DIRECT one - data is guaranteed
to hit the disk before write() returns.

 Normal read/write requires the kernel to buffer and guess access

No it doesn't *require* that.

 patterns correctly to perform read ahead and write behind perfectly to 
 keep the queues full.  In practice, this does not happen perfectly all 
 of the time, or even most of the time, so it slows things down.

So lets fix the kernel for everyone's benefit intead of give us
an API specifically for our needs.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-26 Thread Denis Vlasenko
On Friday 26 January 2007 19:23, Bill Davidsen wrote:
 Denis Vlasenko wrote:
  On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
  Phillip Susi wrote:
  Denis Vlasenko wrote:
  You mean You can use aio_write ?
  Exactly.  You generally don't use O_DIRECT without aio.  Combining the
  two is what gives the big win.
  Well, it's not only aio.  Multithreaded I/O also helps alot -- all this,
  say, to utilize a raid array with many spindles.
 
  But even single-threaded I/O but in large quantities benefits from O_DIRECT
  significantly, and I pointed this out before.
  
  Which shouldn't be true. There is no fundamental reason why
  ordinary writes should be slower than O_DIRECT.
  
 Other than the copy to buffer taking CPU and memory resources.

It is not required by any standard that I know. Kernel can be smarter
and avoid that if it can.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-25 Thread Denis Vlasenko
On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
> Phillip Susi wrote:
> > Denis Vlasenko wrote:
> >> You mean "You can use aio_write" ?
> > 
> > Exactly.  You generally don't use O_DIRECT without aio.  Combining the
> > two is what gives the big win.
> 
> Well, it's not only aio.  Multithreaded I/O also helps alot -- all this,
> say, to utilize a raid array with many spindles.
> 
> But even single-threaded I/O but in large quantities benefits from O_DIRECT
> significantly, and I pointed this out before.

Which shouldn't be true. There is no fundamental reason why
ordinary writes should be slower than O_DIRECT.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-25 Thread Denis Vlasenko
On Thursday 25 January 2007 20:28, Phillip Susi wrote:
> > Ahhh shit, are you saying that fdatasync will wait until writes
> > *by all other processes* to thios file will hit the disk?
> > Is that thue?
> 
> I think all processes yes, but certainly all writes to this file by this 
> process.  That means you have to sync for every write, which means you 
> block.  Blocking stalls the pipeline.

I dont understand you here. Suppose fdatasync() is "do not return until
all cached writes to this file *done by current process* hit the disk
(i.e. cached write data from other concurrent processes is not waited for),
report succes or error code". Then

write(fd_O_DIRECT, buf, sz) - will wait until buf's data hit the disk

write(fd, buf, sz) - potentially will return sooner, but
fdatasync(fd) - will wait until buf's data hit the disk

Looks same to me.

> > If you opened a file and are doing only O_DIRECT writes, you
> > *always* have your written data flushed, by each write().
> > How is it different from writes done using
> > "normal" write() + fdatasync() pairs?
> 
> Because you can do writes async, but not fdatasync ( unless there is an 
> async version I don't know about ).

You mean "You can use aio_write" ?
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-25 Thread Denis Vlasenko
On Thursday 25 January 2007 16:44, Phillip Susi wrote:
> Denis Vlasenko wrote:
> > I will still disagree on this point (on point "use O_DIRECT, it's faster").
> > There is no reason why O_DIRECT should be faster than "normal" read/write
> > to large, aligned buffer. If O_DIRECT is faster on today's kernel,
> > then Linux' read()/write() can be optimized more.
> 
> Ahh but there IS a reason for it to be faster: the application knows 
> what data it will require, so it should tell the kernel rather than ask 
> it to guess.  Even if you had the kernel playing vmsplice games to get 
> avoid the copy to user space ( which still has a fair amount of overhead 
> ), then you still have the problem of the kernel having to guess what 
> data the application will require next, and try to fetch it early.  Then 
> when the application requests the data, if it is not already in memory, 
> the application blocks until it is, and blocking stalls the pipeline.
> 
> > (I hoped that they can be made even *faster* than O_DIRECT, but as I said,
> > you convinced me with your "error reporting" argument that reads must still
> > block until entire buffer is read. Writes can avoid that - apps can do
> > fdatasync/whatever to make sync writes & error checks if they want).
> 
> 
> fdatasync() is not acceptable either because it flushes the entire file.

If you opened a file and are doing only O_DIRECT writes, you
*always* have your written data flushed, by each write().
How is it different from writes done using
"normal" write() + fdatasync() pairs?

>   This does not allow the application to control the ordering of various 
> writes unless it limits itself to a single write/fdatasync pair at a 
> time.  Further, fdatasync again blocks the application.

Ahhh shit, are you saying that fdatasync will wait until writes
*by all other processes* to thios file will hit the disk?
Is that thue?

--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-25 Thread Denis Vlasenko
On Thursday 25 January 2007 16:44, Phillip Susi wrote:
 Denis Vlasenko wrote:
  I will still disagree on this point (on point use O_DIRECT, it's faster).
  There is no reason why O_DIRECT should be faster than normal read/write
  to large, aligned buffer. If O_DIRECT is faster on today's kernel,
  then Linux' read()/write() can be optimized more.
 
 Ahh but there IS a reason for it to be faster: the application knows 
 what data it will require, so it should tell the kernel rather than ask 
 it to guess.  Even if you had the kernel playing vmsplice games to get 
 avoid the copy to user space ( which still has a fair amount of overhead 
 ), then you still have the problem of the kernel having to guess what 
 data the application will require next, and try to fetch it early.  Then 
 when the application requests the data, if it is not already in memory, 
 the application blocks until it is, and blocking stalls the pipeline.
 
  (I hoped that they can be made even *faster* than O_DIRECT, but as I said,
  you convinced me with your error reporting argument that reads must still
  block until entire buffer is read. Writes can avoid that - apps can do
  fdatasync/whatever to make sync writes  error checks if they want).
 
 
 fdatasync() is not acceptable either because it flushes the entire file.

If you opened a file and are doing only O_DIRECT writes, you
*always* have your written data flushed, by each write().
How is it different from writes done using
normal write() + fdatasync() pairs?

   This does not allow the application to control the ordering of various 
 writes unless it limits itself to a single write/fdatasync pair at a 
 time.  Further, fdatasync again blocks the application.

Ahhh shit, are you saying that fdatasync will wait until writes
*by all other processes* to thios file will hit the disk?
Is that thue?

--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-25 Thread Denis Vlasenko
On Thursday 25 January 2007 20:28, Phillip Susi wrote:
  Ahhh shit, are you saying that fdatasync will wait until writes
  *by all other processes* to thios file will hit the disk?
  Is that thue?
 
 I think all processes yes, but certainly all writes to this file by this 
 process.  That means you have to sync for every write, which means you 
 block.  Blocking stalls the pipeline.

I dont understand you here. Suppose fdatasync() is do not return until
all cached writes to this file *done by current process* hit the disk
(i.e. cached write data from other concurrent processes is not waited for),
report succes or error code. Then

write(fd_O_DIRECT, buf, sz) - will wait until buf's data hit the disk

write(fd, buf, sz) - potentially will return sooner, but
fdatasync(fd) - will wait until buf's data hit the disk

Looks same to me.

  If you opened a file and are doing only O_DIRECT writes, you
  *always* have your written data flushed, by each write().
  How is it different from writes done using
  normal write() + fdatasync() pairs?
 
 Because you can do writes async, but not fdatasync ( unless there is an 
 async version I don't know about ).

You mean You can use aio_write ?
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-25 Thread Denis Vlasenko
On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
 Phillip Susi wrote:
  Denis Vlasenko wrote:
  You mean You can use aio_write ?
  
  Exactly.  You generally don't use O_DIRECT without aio.  Combining the
  two is what gives the big win.
 
 Well, it's not only aio.  Multithreaded I/O also helps alot -- all this,
 say, to utilize a raid array with many spindles.
 
 But even single-threaded I/O but in large quantities benefits from O_DIRECT
 significantly, and I pointed this out before.

Which shouldn't be true. There is no fundamental reason why
ordinary writes should be slower than O_DIRECT.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-24 Thread Denis Vlasenko
On Monday 22 January 2007 17:17, Phillip Susi wrote:
> > You do not need to know which read() exactly failed due to bad disk.
> > Filename and offset from the start is enough. Right?
> > 
> > So, SIGIO/SIGBUS can provide that, and if your handler is of
> > void (*sa_sigaction)(int, siginfo_t *, void *);
> > style, you can get fd, memory address of the fault, etc.
> > Probably kernel can even pass file offset somewhere in siginfo_t...
> 
> Sure... now what does your signal handler have to do in order to handle 
> this error in such a way as to allow the one request to be failed and 
> the task to continue handling other requests?  I don't think this is 
> even possible, yet alone clean.

Actually, you have convinced me on this. While it's is possible
to report error to userspace, it will be highly nontrivial (read:
bug-prone) for userspace to catch and act on the errors.

> > You think "Oracle". But this application may very well be
> > not Oracle, but diff, or dd, or KMail. I don't want to care.
> > I want all big writes to be efficient, not just those done by Oracle.
> > *Including* single threaded ones.
> 
> Then redesign those applications to use aio and O_DIRECT.  Incidentally 
> I have hacked up dd to do just that and have some very nice performance 
> numbers as a result.

I will still disagree on this point (on point "use O_DIRECT, it's faster").
There is no reason why O_DIRECT should be faster than "normal" read/write
to large, aligned buffer. If O_DIRECT is faster on today's kernel,
then Linux' read()/write() can be optimized more.

(I hoped that they can be made even *faster* than O_DIRECT, but as I said,
you convinced me with your "error reporting" argument that reads must still
block until entire buffer is read. Writes can avoid that - apps can do
fdatasync/whatever to make sync writes & error checks if they want).
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-24 Thread Denis Vlasenko
On Monday 22 January 2007 17:17, Phillip Susi wrote:
  You do not need to know which read() exactly failed due to bad disk.
  Filename and offset from the start is enough. Right?
  
  So, SIGIO/SIGBUS can provide that, and if your handler is of
  void (*sa_sigaction)(int, siginfo_t *, void *);
  style, you can get fd, memory address of the fault, etc.
  Probably kernel can even pass file offset somewhere in siginfo_t...
 
 Sure... now what does your signal handler have to do in order to handle 
 this error in such a way as to allow the one request to be failed and 
 the task to continue handling other requests?  I don't think this is 
 even possible, yet alone clean.

Actually, you have convinced me on this. While it's is possible
to report error to userspace, it will be highly nontrivial (read:
bug-prone) for userspace to catch and act on the errors.

  You think Oracle. But this application may very well be
  not Oracle, but diff, or dd, or KMail. I don't want to care.
  I want all big writes to be efficient, not just those done by Oracle.
  *Including* single threaded ones.
 
 Then redesign those applications to use aio and O_DIRECT.  Incidentally 
 I have hacked up dd to do just that and have some very nice performance 
 numbers as a result.

I will still disagree on this point (on point use O_DIRECT, it's faster).
There is no reason why O_DIRECT should be faster than normal read/write
to large, aligned buffer. If O_DIRECT is faster on today's kernel,
then Linux' read()/write() can be optimized more.

(I hoped that they can be made even *faster* than O_DIRECT, but as I said,
you convinced me with your error reporting argument that reads must still
block until entire buffer is read. Writes can avoid that - apps can do
fdatasync/whatever to make sync writes  error checks if they want).
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-21 Thread Denis Vlasenko
On Sunday 21 January 2007 13:09, Michael Tokarev wrote:
> Denis Vlasenko wrote:
> > On Saturday 20 January 2007 21:55, Michael Tokarev wrote:
> >> Denis Vlasenko wrote:
> >>> On Thursday 11 January 2007 18:13, Michael Tokarev wrote:
> >>>> example, which isn't quite possible now from userspace.  But as long as
> >>>> O_DIRECT actually writes data before returning from write() call (as it
> >>>> seems to be the case at least with a normal filesystem on a real block
> >>>> device - I don't touch corner cases like nfs here), it's pretty much
> >>>> THE ideal solution, at least from the application (developer) standpoint.
> >>> Why do you want to wait while 100 megs of data are being written?
> >>> You _have to_ have threaded db code in order to not waste
> >>> gobs of CPU time on UP + even with that you eat context switch
> >>> penalty anyway.
> >> Usually it's done using aio ;)
> >>
> >> It's not that simple really.
> >>
> >> For reads, you have to wait for the data anyway before doing something
> >> with it.  Omiting reads for now.
> > 
> > Really? All 100 megs _at once_? Linus described fairly simple (conceptually)
> > idea here: http://lkml.org/lkml/2002/5/11/58
> > In short, page-aligned read buffer can be just unmapped,
> > with page fault handler catching accesses to yet-unread data.
> > As data comes from disk, it gets mapped back in process'
> > address space.
> 
> > This way read() returns almost immediately and CPU is free to do
> > something useful.
> 
> And what the application does during that page fault?  Waits for the read
> to actually complete?  How it's different from a regular (direct or not)
> read?

The difference is that you block exactly when you try to access
data which is not there yet, not sooner (potentially much sooner).

If application (e.g. database) needs to know whether data is _really_ there,
it should use aio_read (or something better, something which doesn't use 
signals.
Do we have this 'something'? I honestly don't know).

In some cases, evne this is not needed because you don't have any other
things to do, so you just do read() (which returns early), and chew on
data. If your CPU is fast enough and processing of data is light enough
so that it outruns disk - big deal, you block in page fault handler
whenever a page is not read for you in time.
If CPU isn't fast enough, your CPU and disk subsystem are nicely working
in parallel.

With O_DIRECT, you alternate:
"CPU is idle, disk is working" / "CPU is working, disk is idle".

> Well, it IS different: now we can't predict *when* exactly we'll sleep waiting
> for the read to complete.  And also, now we're in an unknown-corner-case when
> an I/O error occurs, too (I/O errors iteracts badly with things like mmap, and
> this looks more like mmap than like actual read).
> 
> Yes, this way we'll fix the problems in current O_DIRECT way of doing things -
> all those rases and design stupidity etc.  Yes it may work, provided those
> "corner cases" like I/O errors problems will be fixed.

What do you want to do on I/O error? I guess you cannot do much -
any sensible db will shutdown itself. When your data storage
starts to fail, it's pointless to continue running.

You do not need to know which read() exactly failed due to bad disk.
Filename and offset from the start is enough. Right?

So, SIGIO/SIGBUS can provide that, and if your handler is of
void (*sa_sigaction)(int, siginfo_t *, void *);
style, you can get fd, memory address of the fault, etc.
Probably kernel can even pass file offset somewhere in siginfo_t...

> And yes, sometimes 
> it's not really that interesting to know when exactly we'll sleep actually
> waiting for the I/O - during read or during some memory access...

It differs from performance perspective, as dicussed above.

> There may be other reasons to "want" those extra context switches.
> I mentioned above that oracle doesn't use threads, but processes.

You can still be multithreaded. The point is, with O_DIRECT
you _are forced_ to_ be_ multithreaded, or else perfomance will suck.

> > Assume that we have "clever writes" like Linus described.
> > 
> > /* something like "caching i/o over this fd is mostly useless" */
> > /* (looks like this API is easier to transition to
> >  * than fadvise etc. - it's "looks like" O_DIRECT) */
> > fd = open(..., flags|O_STREAM);
> > ...
> > /* Starts writeout immediately due to O_STREAM,
> >  * marks buf100meg's pages R/O to catch modifications,
> >  * but doesn't block! */
> > write(fd, buf100meg, 100*1024*

Re: O_DIRECT question

2007-01-21 Thread Denis Vlasenko
On Sunday 21 January 2007 13:09, Michael Tokarev wrote:
 Denis Vlasenko wrote:
  On Saturday 20 January 2007 21:55, Michael Tokarev wrote:
  Denis Vlasenko wrote:
  On Thursday 11 January 2007 18:13, Michael Tokarev wrote:
  example, which isn't quite possible now from userspace.  But as long as
  O_DIRECT actually writes data before returning from write() call (as it
  seems to be the case at least with a normal filesystem on a real block
  device - I don't touch corner cases like nfs here), it's pretty much
  THE ideal solution, at least from the application (developer) standpoint.
  Why do you want to wait while 100 megs of data are being written?
  You _have to_ have threaded db code in order to not waste
  gobs of CPU time on UP + even with that you eat context switch
  penalty anyway.
  Usually it's done using aio ;)
 
  It's not that simple really.
 
  For reads, you have to wait for the data anyway before doing something
  with it.  Omiting reads for now.
  
  Really? All 100 megs _at once_? Linus described fairly simple (conceptually)
  idea here: http://lkml.org/lkml/2002/5/11/58
  In short, page-aligned read buffer can be just unmapped,
  with page fault handler catching accesses to yet-unread data.
  As data comes from disk, it gets mapped back in process'
  address space.
 
  This way read() returns almost immediately and CPU is free to do
  something useful.
 
 And what the application does during that page fault?  Waits for the read
 to actually complete?  How it's different from a regular (direct or not)
 read?

The difference is that you block exactly when you try to access
data which is not there yet, not sooner (potentially much sooner).

If application (e.g. database) needs to know whether data is _really_ there,
it should use aio_read (or something better, something which doesn't use 
signals.
Do we have this 'something'? I honestly don't know).

In some cases, evne this is not needed because you don't have any other
things to do, so you just do read() (which returns early), and chew on
data. If your CPU is fast enough and processing of data is light enough
so that it outruns disk - big deal, you block in page fault handler
whenever a page is not read for you in time.
If CPU isn't fast enough, your CPU and disk subsystem are nicely working
in parallel.

With O_DIRECT, you alternate:
CPU is idle, disk is working / CPU is working, disk is idle.

 Well, it IS different: now we can't predict *when* exactly we'll sleep waiting
 for the read to complete.  And also, now we're in an unknown-corner-case when
 an I/O error occurs, too (I/O errors iteracts badly with things like mmap, and
 this looks more like mmap than like actual read).
 
 Yes, this way we'll fix the problems in current O_DIRECT way of doing things -
 all those rases and design stupidity etc.  Yes it may work, provided those
 corner cases like I/O errors problems will be fixed.

What do you want to do on I/O error? I guess you cannot do much -
any sensible db will shutdown itself. When your data storage
starts to fail, it's pointless to continue running.

You do not need to know which read() exactly failed due to bad disk.
Filename and offset from the start is enough. Right?

So, SIGIO/SIGBUS can provide that, and if your handler is of
void (*sa_sigaction)(int, siginfo_t *, void *);
style, you can get fd, memory address of the fault, etc.
Probably kernel can even pass file offset somewhere in siginfo_t...

 And yes, sometimes 
 it's not really that interesting to know when exactly we'll sleep actually
 waiting for the I/O - during read or during some memory access...

It differs from performance perspective, as dicussed above.

 There may be other reasons to want those extra context switches.
 I mentioned above that oracle doesn't use threads, but processes.

You can still be multithreaded. The point is, with O_DIRECT
you _are forced_ to_ be_ multithreaded, or else perfomance will suck.

  Assume that we have clever writes like Linus described.
  
  /* something like caching i/o over this fd is mostly useless */
  /* (looks like this API is easier to transition to
   * than fadvise etc. - it's looks like O_DIRECT) */
  fd = open(..., flags|O_STREAM);
  ...
  /* Starts writeout immediately due to O_STREAM,
   * marks buf100meg's pages R/O to catch modifications,
   * but doesn't block! */
  write(fd, buf100meg, 100*1024*1024);
 
 And how do we know when the write completes?
 
  /* We are free to do something useful in parallel */
  sort();
 
 .. which is done in another process, already started.

You think Oracle. But this application may very well be
not Oracle, but diff, or dd, or KMail. I don't want to care.
I want all big writes to be efficient, not just those done by Oracle.
*Including* single threaded ones.

  Why we bothered to write Linux at all?
  There were other Unixes which worked ok.
 
 Denis, please realize - I'm not an oracle guy (or database guy or 
 whatever).
 I'm not really

Re: O_DIRECT question

2007-01-20 Thread Denis Vlasenko
On Saturday 20 January 2007 21:55, Michael Tokarev wrote:
> Denis Vlasenko wrote:
> > On Thursday 11 January 2007 18:13, Michael Tokarev wrote:
> >> example, which isn't quite possible now from userspace.  But as long as
> >> O_DIRECT actually writes data before returning from write() call (as it
> >> seems to be the case at least with a normal filesystem on a real block
> >> device - I don't touch corner cases like nfs here), it's pretty much
> >> THE ideal solution, at least from the application (developer) standpoint.
> > 
> > Why do you want to wait while 100 megs of data are being written?
> > You _have to_ have threaded db code in order to not waste
> > gobs of CPU time on UP + even with that you eat context switch
> > penalty anyway.
> 
> Usually it's done using aio ;)
> 
> It's not that simple really.
> 
> For reads, you have to wait for the data anyway before doing something
> with it.  Omiting reads for now.

Really? All 100 megs _at once_? Linus described fairly simple (conceptually)
idea here: http://lkml.org/lkml/2002/5/11/58
In short, page-aligned read buffer can be just unmapped,
with page fault handler catching accesses to yet-unread data.
As data comes from disk, it gets mapped back in process'
address space.

This way read() returns almost immediately and CPU is free to do
something useful.

> For writes, it's not that problematic - even 10-15 threads is nothing
> compared with the I/O (O in this case) itself -- that context switch
> penalty.

Well, if you have some CPU intensive thing to do (e.g. sort),
why not benefit from lack of extra context switch?
Assume that we have "clever writes" like Linus described.

/* something like "caching i/o over this fd is mostly useless" */
/* (looks like this API is easier to transition to
 * than fadvise etc. - it's "looks like" O_DIRECT) */
fd = open(..., flags|O_STREAM);
...
/* Starts writeout immediately due to O_STREAM,
 * marks buf100meg's pages R/O to catch modifications,
 * but doesn't block! */
write(fd, buf100meg, 100*1024*1024);
/* We are free to do something useful in parallel */
sort();

> > I hope you agree that threaded code is not ideal performance-wise
> > - async IO is better. O_DIRECT is strictly sync IO.
> 
> Hmm.. Now I'm confused.
> 
> For example, oracle uses aio + O_DIRECT.  It seems to be working... ;)
> As an alternative, there are multiple single-threaded db_writer processes.
> Why do you say O_DIRECT is strictly sync?

I mean that O_DIRECT write() blocks until I/O really is done.
Normal write can block for much less, or not at all.

> In either case - I provided some real numbers in this thread before.
> Yes, O_DIRECT has its problems, even security problems.  But the thing
> is - it is working, and working WAY better - from the performance point
> of view - than "indirect" I/O, and currently there's no alternative that
> works as good as O_DIRECT.

Why we bothered to write Linux at all?
There were other Unixes which worked ok.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-20 Thread Denis Vlasenko
On Sunday 14 January 2007 10:11, Nate Diller wrote:
> On 1/12/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> Most applications don't get the kind of performance analysis that
> Digeo was doing, and even then, it's rather lucky that we caught that.
>  So I personally think it'd be best for libc or something to simulate
> the O_STREAM behavior if you ask for it.  That would simplify things
> for the most common case, and have the side benefit of reducing the
> amount of extra code an application would need in order to take
> advantage of that feature.

Sounds like you are saying that making O_DIRECT really mean
O_STREAM will work for everybody (including db people,
except that they will moan a lot about "it isn't _real_ O_DIRECT!!!
Linux suxxx"). I don't care about that.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-20 Thread Denis Vlasenko
On Thursday 11 January 2007 18:13, Michael Tokarev wrote:
> example, which isn't quite possible now from userspace.  But as long as
> O_DIRECT actually writes data before returning from write() call (as it
> seems to be the case at least with a normal filesystem on a real block
> device - I don't touch corner cases like nfs here), it's pretty much
> THE ideal solution, at least from the application (developer) standpoint.

Why do you want to wait while 100 megs of data are being written?
You _have to_ have threaded db code in order to not waste
gobs of CPU time on UP + even with that you eat context switch
penalty anyway.

I hope you agree that threaded code is not ideal performance-wise
- async IO is better. O_DIRECT is strictly sync IO.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-20 Thread Denis Vlasenko
On Thursday 11 January 2007 16:50, Linus Torvalds wrote:
> 
> On Thu, 11 Jan 2007, Nick Piggin wrote:
> > 
> > Speaking of which, why did we obsolete raw devices? And/or why not just
> > go with a minimal O_DIRECT on block device support? Not a rhetorical
> > question -- I wasn't involved in the discussions when they happened, so
> > I would be interested.
> 
> Lots of people want to put their databases in a file. Partitions really 
> weren't nearly flexible enough. So the whole raw device or O_DIRECT just 
> to the block device thing isn't really helping any.
> 
> > O_DIRECT is still crazily racy versus pagecache operations.
> 
> Yes. O_DIRECT is really fundamentally broken. There's just no way to fix 
> it sanely. Except by teaching people not to use it, and making the normal 
> paths fast enough (and that _includes_ doing things like dropping caches 
> more aggressively, but it probably would include more work on the device 
> queue merging stuff etc etc).

What will happen if we just make open ignore O_DIRECT? ;)

And then anyone who feels sad about is advised to do it
like described here:

http://lkml.org/lkml/2002/5/11/58
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-20 Thread Denis Vlasenko
On Thursday 11 January 2007 16:50, Linus Torvalds wrote:
 
 On Thu, 11 Jan 2007, Nick Piggin wrote:
  
  Speaking of which, why did we obsolete raw devices? And/or why not just
  go with a minimal O_DIRECT on block device support? Not a rhetorical
  question -- I wasn't involved in the discussions when they happened, so
  I would be interested.
 
 Lots of people want to put their databases in a file. Partitions really 
 weren't nearly flexible enough. So the whole raw device or O_DIRECT just 
 to the block device thing isn't really helping any.
 
  O_DIRECT is still crazily racy versus pagecache operations.
 
 Yes. O_DIRECT is really fundamentally broken. There's just no way to fix 
 it sanely. Except by teaching people not to use it, and making the normal 
 paths fast enough (and that _includes_ doing things like dropping caches 
 more aggressively, but it probably would include more work on the device 
 queue merging stuff etc etc).

What will happen if we just make open ignore O_DIRECT? ;)

And then anyone who feels sad about is advised to do it
like described here:

http://lkml.org/lkml/2002/5/11/58
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-20 Thread Denis Vlasenko
On Thursday 11 January 2007 18:13, Michael Tokarev wrote:
 example, which isn't quite possible now from userspace.  But as long as
 O_DIRECT actually writes data before returning from write() call (as it
 seems to be the case at least with a normal filesystem on a real block
 device - I don't touch corner cases like nfs here), it's pretty much
 THE ideal solution, at least from the application (developer) standpoint.

Why do you want to wait while 100 megs of data are being written?
You _have to_ have threaded db code in order to not waste
gobs of CPU time on UP + even with that you eat context switch
penalty anyway.

I hope you agree that threaded code is not ideal performance-wise
- async IO is better. O_DIRECT is strictly sync IO.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-20 Thread Denis Vlasenko
On Sunday 14 January 2007 10:11, Nate Diller wrote:
 On 1/12/07, Andrew Morton [EMAIL PROTECTED] wrote:
 Most applications don't get the kind of performance analysis that
 Digeo was doing, and even then, it's rather lucky that we caught that.
  So I personally think it'd be best for libc or something to simulate
 the O_STREAM behavior if you ask for it.  That would simplify things
 for the most common case, and have the side benefit of reducing the
 amount of extra code an application would need in order to take
 advantage of that feature.

Sounds like you are saying that making O_DIRECT really mean
O_STREAM will work for everybody (including db people,
except that they will moan a lot about it isn't _real_ O_DIRECT!!!
Linux suxxx). I don't care about that.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-20 Thread Denis Vlasenko
On Saturday 20 January 2007 21:55, Michael Tokarev wrote:
 Denis Vlasenko wrote:
  On Thursday 11 January 2007 18:13, Michael Tokarev wrote:
  example, which isn't quite possible now from userspace.  But as long as
  O_DIRECT actually writes data before returning from write() call (as it
  seems to be the case at least with a normal filesystem on a real block
  device - I don't touch corner cases like nfs here), it's pretty much
  THE ideal solution, at least from the application (developer) standpoint.
  
  Why do you want to wait while 100 megs of data are being written?
  You _have to_ have threaded db code in order to not waste
  gobs of CPU time on UP + even with that you eat context switch
  penalty anyway.
 
 Usually it's done using aio ;)
 
 It's not that simple really.
 
 For reads, you have to wait for the data anyway before doing something
 with it.  Omiting reads for now.

Really? All 100 megs _at once_? Linus described fairly simple (conceptually)
idea here: http://lkml.org/lkml/2002/5/11/58
In short, page-aligned read buffer can be just unmapped,
with page fault handler catching accesses to yet-unread data.
As data comes from disk, it gets mapped back in process'
address space.

This way read() returns almost immediately and CPU is free to do
something useful.

 For writes, it's not that problematic - even 10-15 threads is nothing
 compared with the I/O (O in this case) itself -- that context switch
 penalty.

Well, if you have some CPU intensive thing to do (e.g. sort),
why not benefit from lack of extra context switch?
Assume that we have clever writes like Linus described.

/* something like caching i/o over this fd is mostly useless */
/* (looks like this API is easier to transition to
 * than fadvise etc. - it's looks like O_DIRECT) */
fd = open(..., flags|O_STREAM);
...
/* Starts writeout immediately due to O_STREAM,
 * marks buf100meg's pages R/O to catch modifications,
 * but doesn't block! */
write(fd, buf100meg, 100*1024*1024);
/* We are free to do something useful in parallel */
sort();

  I hope you agree that threaded code is not ideal performance-wise
  - async IO is better. O_DIRECT is strictly sync IO.
 
 Hmm.. Now I'm confused.
 
 For example, oracle uses aio + O_DIRECT.  It seems to be working... ;)
 As an alternative, there are multiple single-threaded db_writer processes.
 Why do you say O_DIRECT is strictly sync?

I mean that O_DIRECT write() blocks until I/O really is done.
Normal write can block for much less, or not at all.

 In either case - I provided some real numbers in this thread before.
 Yes, O_DIRECT has its problems, even security problems.  But the thing
 is - it is working, and working WAY better - from the performance point
 of view - than indirect I/O, and currently there's no alternative that
 works as good as O_DIRECT.

Why we bothered to write Linux at all?
There were other Unixes which worked ok.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Finding hardlinks

2007-01-11 Thread Denis Vlasenko
On Wednesday 03 January 2007 21:26, Frank van Maarseveen wrote:
> On Wed, Jan 03, 2007 at 08:31:32PM +0100, Mikulas Patocka wrote:
> > 64-bit inode numbers space is not yet implemented on Linux --- the problem 
> > is that if you return ino >= 2^32, programs compiled without 
> > -D_FILE_OFFSET_BITS=64 will fail with stat() returning -EOVERFLOW --- this 
> > failure is specified in POSIX, but not very useful.
> 
> hmm, checking iunique(), ino_t, __kernel_ino_t... I see. Pity. So at
> some point in time we may need a sort of "ino64" mount option to be
> able to switch to a 64 bit number space on mount basis. Or (conversely)
> refuse to mount without that option if we know there are >32 bit st_ino
> out there. And invent iunique64() and use that when "ino64" specified
> for FAT/SMB/...  when those filesystems haven't been replaced by a
> successor by that time.
> 
> At that time probably all programs are either compiled with
> -D_FILE_OFFSET_BITS=64 (most already are because of files bigger than 2G)
> or completely 64 bit. 

Good plan. Be prepared to redo it again when 64bits will feel "small" also.
Then again when 128bit will be "small". Don't tell me this won't happen.
15 years ago people would laugh about 32bit inode numbers being not enough.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Finding hardlinks

2007-01-11 Thread Denis Vlasenko
On Wednesday 03 January 2007 13:42, Pavel Machek wrote:
> I guess that is the way to go. samefile(path1, path2) is unfortunately
> inherently racy.

Not a problem in practice. You don't expect cp -a
to reliably copy a tree which something else is modifying
at the same time.

Thus we assume that the tree we operate on is not modified.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Finding hardlinks

2007-01-11 Thread Denis Vlasenko
On Thursday 28 December 2006 10:06, Benny Halevy wrote:
> Mikulas Patocka wrote:
> >>> If user (or script) doesn't specify that flag, it doesn't help. I think
> >>> the best solution for these filesystems would be either to add new syscall
> >>>   int is_hardlink(char *filename1, char *filename2)
> >>> (but I know adding syscall bloat may be objectionable)
> >> it's also the wrong api; the filenames may have been changed under you
> >> just as you return from this call, so it really is a
> >> "was_hardlink_at_some_point()" as you specify it.
> >> If you make it work on fd's.. it has a chance at least.
> > 
> > Yes, but it doesn't matter --- if the tree changes under "cp -a" command, 
> > no one guarantees you what you get.
> > int fis_hardlink(int handle1, int handle 2);
> > Is another possibility but it can't detect hardlinked symlinks.

It also suffers from combinatorial explosion.
cp -a on 10^6 files will require ~0.5 * 10^12 compares...
 
> It seems like the posix idea of unique  doesn't
> hold water for modern file systems and that creates real problems for
> backup apps which rely on that to detect hard links.

Yes, and it should have been obvious at 32->64bit inode# transition.
Unfortunately people tend to think "ok, NOW this new shiny BIGNUM-bit
field is big enough for everybody". Then cycle repeats in five years...

I think the solution is that inode "numbers" should become
opaque _variable-length_ hashes. They are already just hash values,
this is nothing new. All problems stem from fixed width of inode# only.

--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PATCH - x86-64 signed-compare bug, was Re: select() setting ERESTARTNOHAND (514).

2007-01-11 Thread Denis Vlasenko
On Thursday 11 January 2007 02:02, Neil Brown wrote:
> If regs->rax is unsigned long, then I would think the compiler would
> be allowed to convert
> 
>switch (regs->rax) {
>   case -514 : whatever;
>}
> 
> to a no-op, as regs->rax will never have a negative value.

In C, you never actually compare different types. They always
promoted to some common type first.

both sides of (impicit) == here get promoted to "biggest" integer,
in this case, to unsigned long. "-514" is an int, so it gets
sign extended to the width of "long" and then converted to
unsigned long.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PATCH - x86-64 signed-compare bug, was Re: select() setting ERESTARTNOHAND (514).

2007-01-11 Thread Denis Vlasenko
On Thursday 11 January 2007 02:02, Neil Brown wrote:
 If regs-rax is unsigned long, then I would think the compiler would
 be allowed to convert
 
switch (regs-rax) {
   case -514 : whatever;
}
 
 to a no-op, as regs-rax will never have a negative value.

In C, you never actually compare different types. They always
promoted to some common type first.

both sides of (impicit) == here get promoted to biggest integer,
in this case, to unsigned long. -514 is an int, so it gets
sign extended to the width of long and then converted to
unsigned long.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Finding hardlinks

2007-01-11 Thread Denis Vlasenko
On Thursday 28 December 2006 10:06, Benny Halevy wrote:
 Mikulas Patocka wrote:
  If user (or script) doesn't specify that flag, it doesn't help. I think
  the best solution for these filesystems would be either to add new syscall
int is_hardlink(char *filename1, char *filename2)
  (but I know adding syscall bloat may be objectionable)
  it's also the wrong api; the filenames may have been changed under you
  just as you return from this call, so it really is a
  was_hardlink_at_some_point() as you specify it.
  If you make it work on fd's.. it has a chance at least.
  
  Yes, but it doesn't matter --- if the tree changes under cp -a command, 
  no one guarantees you what you get.
  int fis_hardlink(int handle1, int handle 2);
  Is another possibility but it can't detect hardlinked symlinks.

It also suffers from combinatorial explosion.
cp -a on 10^6 files will require ~0.5 * 10^12 compares...
 
 It seems like the posix idea of unique st_dev, st_ino doesn't
 hold water for modern file systems and that creates real problems for
 backup apps which rely on that to detect hard links.

Yes, and it should have been obvious at 32-64bit inode# transition.
Unfortunately people tend to think ok, NOW this new shiny BIGNUM-bit
field is big enough for everybody. Then cycle repeats in five years...

I think the solution is that inode numbers should become
opaque _variable-length_ hashes. They are already just hash values,
this is nothing new. All problems stem from fixed width of inode# only.

--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Finding hardlinks

2007-01-11 Thread Denis Vlasenko
On Wednesday 03 January 2007 13:42, Pavel Machek wrote:
 I guess that is the way to go. samefile(path1, path2) is unfortunately
 inherently racy.

Not a problem in practice. You don't expect cp -a
to reliably copy a tree which something else is modifying
at the same time.

Thus we assume that the tree we operate on is not modified.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Finding hardlinks

2007-01-11 Thread Denis Vlasenko
On Wednesday 03 January 2007 21:26, Frank van Maarseveen wrote:
 On Wed, Jan 03, 2007 at 08:31:32PM +0100, Mikulas Patocka wrote:
  64-bit inode numbers space is not yet implemented on Linux --- the problem 
  is that if you return ino = 2^32, programs compiled without 
  -D_FILE_OFFSET_BITS=64 will fail with stat() returning -EOVERFLOW --- this 
  failure is specified in POSIX, but not very useful.
 
 hmm, checking iunique(), ino_t, __kernel_ino_t... I see. Pity. So at
 some point in time we may need a sort of ino64 mount option to be
 able to switch to a 64 bit number space on mount basis. Or (conversely)
 refuse to mount without that option if we know there are 32 bit st_ino
 out there. And invent iunique64() and use that when ino64 specified
 for FAT/SMB/...  when those filesystems haven't been replaced by a
 successor by that time.
 
 At that time probably all programs are either compiled with
 -D_FILE_OFFSET_BITS=64 (most already are because of files bigger than 2G)
 or completely 64 bit. 

Good plan. Be prepared to redo it again when 64bits will feel small also.
Then again when 128bit will be small. Don't tell me this won't happen.
15 years ago people would laugh about 32bit inode numbers being not enough.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel + gcc 4.1 = several problems

2007-01-06 Thread Denis Vlasenko
On Thursday 04 January 2007 18:37, Linus Torvalds wrote:
> With 7+ million lines of C code and headers, I'm not interested in 
> compilers that read the letter of the law. We don't want some really 
> clever code generation that gets us .5% on some unrealistic load. We want 
> good _solid_ code generation that does the obvious thing.
> 
> Compiler writers seem to seldom even realize this. A lot of commercial 
> code gets shipped with basically no optimizations at all (or with specific 
> optimizations turned off), because people want to ship what they debug and 
> work with.

I'd say "care about obvious, safe optimizations which we still not do".
I want this:

char v[4];
...
memcmp(v, "abcd", 4) == 0

compile to single cmpl on i386. This (gcc 4.1.1) is ridiculous:

.LC0:
.string "abcd"
.text
...
pushl   $4
pushl   $.LC0
pushl   $v
callmemcmp
addl$12, %esp
testl   %eax, %eax

There are tons of examples where you can improve code generation.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel + gcc 4.1 = several problems

2007-01-06 Thread Denis Vlasenko
On Thursday 04 January 2007 18:37, Linus Torvalds wrote:
 With 7+ million lines of C code and headers, I'm not interested in 
 compilers that read the letter of the law. We don't want some really 
 clever code generation that gets us .5% on some unrealistic load. We want 
 good _solid_ code generation that does the obvious thing.
 
 Compiler writers seem to seldom even realize this. A lot of commercial 
 code gets shipped with basically no optimizations at all (or with specific 
 optimizations turned off), because people want to ship what they debug and 
 work with.

I'd say care about obvious, safe optimizations which we still not do.
I want this:

char v[4];
...
memcmp(v, abcd, 4) == 0

compile to single cmpl on i386. This (gcc 4.1.1) is ridiculous:

.LC0:
.string abcd
.text
...
pushl   $4
pushl   $.LC0
pushl   $v
callmemcmp
addl$12, %esp
testl   %eax, %eax

There are tons of examples where you can improve code generation.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-05 Thread Denis Vlasenko
On Friday 05 January 2007 17:20, Bill Davidsen wrote:
> Denis Vlasenko wrote:
> > But O_DIRECT is _not_ about cache. At least I think it was not about
> > cache initially, it was more about DMAing data directly from/to
> > application address space to/from disks, saving memcpy's and double
> > allocations. Why do you think it has that special alignment requirements?
> > Are they cache related? Not at all!

> I'm not sure I can see how you find "don't use cache" not cache related. 
> Saving the resources needed for cache would seem to obviously leave them 
> for other processes.

I feel that word "direct" has nothing to do with caching (or lack thereof).
"Direct" means that I want to avoid extra allocations and memcpy:

write(fd, hugebuf, 100*1024*1024);

Here application uses 100 megs for hugebuf, and if it is not sufficiently
aligned, even smartest kernel in this universe cannot DMA this data
to disk. No way. So it needs to allocate ANOTHER, aligned buffer,
memcpy the data (completely flushing L1 and L2 dcaches), and DMA it
from there. Thus we use twice as much RAM as we really need, and do
a lot of mostly pointless memory moves! And worse, application cannot
even detect it - it works, it's just slow and eats a lot of RAM and CPU.

That's where O_DIRECT helps. When app wants to avoid that, it opens fd
with O_DIRECT. App in effect says: "I *do* want to avoid extra shuffling,
because I will write huge amounts of data in big blocks."

> > But _conceptually_ "direct DMAing" and "do-not-cache-me"
> > are orthogonal, right?
>
> In the sense that you must do DMA or use cache, yes.

Let's say I implemented a heuristic in my cp command:
if source file is indeed a regular file and it is
larger than 128K, allocate aligned 128K buffer
and try to copy it using O_DIRECT i/o.

Then I use this "enhanced" cp command to copy a large directory
recursively, and then I run grep on that directory.

Can you explain why cp shouldn't cache the data it just wrote?
I *am* going to use it shortly thereafter!

> > That's why we also have bona fide fadvise and madvise
> > with FADV_DONTNEED/MADV_DONTNEED:
> >
> > http://www.die.net/doc/linux/man/man2/fadvise.2.html
> > http://www.die.net/doc/linux/man/man2/madvise.2.html
> >
> > _This_ is the proper way to say "do not cache me".
>
> But none of those advisories says how to cache or not, only what the 
> expected behavior will be. So FADV_NOREUSE does not control cache use, 
> it simply allows the system to make assumptions.

Exactly. If you don't need the data, Just let kernel know that.
When you use O_DIRECT, you are saying "I want direct DMA to disk without
extra copying". With fadvise(FADV_DONTNEED) you are saying
"do not expect access in the near future" == "do not try to optimize
for possible accesses in near future" == "do not cache"!.

Again: with O_DIRECT:

write(fd, hugebuf, 100*1024*1024);

kernel _has _difficulty_ caching these data, simply because
data isn't copied into kernel pages anyway, and if user will
continue to use hugebuf after write(), kernel simply cannot
cache that data - it _hasn't_ the data.

But if user will unmap the hugebuf? What then? Should kernel
forget that data in these pages is in effect a cached data from
the file being written to? Not necessarily.

Four years ago Linus wrote an email about it:

http://lkml.org/lkml/2002/5/11/58

btw, as an Oracle DBA on my day job, I completely agree
with Linus on the "deranged monkey" comparison in that mail...
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-05 Thread Denis Vlasenko
On Friday 05 January 2007 17:20, Bill Davidsen wrote:
 Denis Vlasenko wrote:
  But O_DIRECT is _not_ about cache. At least I think it was not about
  cache initially, it was more about DMAing data directly from/to
  application address space to/from disks, saving memcpy's and double
  allocations. Why do you think it has that special alignment requirements?
  Are they cache related? Not at all!

 I'm not sure I can see how you find don't use cache not cache related. 
 Saving the resources needed for cache would seem to obviously leave them 
 for other processes.

I feel that word direct has nothing to do with caching (or lack thereof).
Direct means that I want to avoid extra allocations and memcpy:

write(fd, hugebuf, 100*1024*1024);

Here application uses 100 megs for hugebuf, and if it is not sufficiently
aligned, even smartest kernel in this universe cannot DMA this data
to disk. No way. So it needs to allocate ANOTHER, aligned buffer,
memcpy the data (completely flushing L1 and L2 dcaches), and DMA it
from there. Thus we use twice as much RAM as we really need, and do
a lot of mostly pointless memory moves! And worse, application cannot
even detect it - it works, it's just slow and eats a lot of RAM and CPU.

That's where O_DIRECT helps. When app wants to avoid that, it opens fd
with O_DIRECT. App in effect says: I *do* want to avoid extra shuffling,
because I will write huge amounts of data in big blocks.

  But _conceptually_ direct DMAing and do-not-cache-me
  are orthogonal, right?

 In the sense that you must do DMA or use cache, yes.

Let's say I implemented a heuristic in my cp command:
if source file is indeed a regular file and it is
larger than 128K, allocate aligned 128K buffer
and try to copy it using O_DIRECT i/o.

Then I use this enhanced cp command to copy a large directory
recursively, and then I run grep on that directory.

Can you explain why cp shouldn't cache the data it just wrote?
I *am* going to use it shortly thereafter!

  That's why we also have bona fide fadvise and madvise
  with FADV_DONTNEED/MADV_DONTNEED:
 
  http://www.die.net/doc/linux/man/man2/fadvise.2.html
  http://www.die.net/doc/linux/man/man2/madvise.2.html
 
  _This_ is the proper way to say do not cache me.

 But none of those advisories says how to cache or not, only what the 
 expected behavior will be. So FADV_NOREUSE does not control cache use, 
 it simply allows the system to make assumptions.

Exactly. If you don't need the data, Just let kernel know that.
When you use O_DIRECT, you are saying I want direct DMA to disk without
extra copying. With fadvise(FADV_DONTNEED) you are saying
do not expect access in the near future == do not try to optimize
for possible accesses in near future == do not cache!.

Again: with O_DIRECT:

write(fd, hugebuf, 100*1024*1024);

kernel _has _difficulty_ caching these data, simply because
data isn't copied into kernel pages anyway, and if user will
continue to use hugebuf after write(), kernel simply cannot
cache that data - it _hasn't_ the data.

But if user will unmap the hugebuf? What then? Should kernel
forget that data in these pages is in effect a cached data from
the file being written to? Not necessarily.

Four years ago Linus wrote an email about it:

http://lkml.org/lkml/2002/5/11/58

btw, as an Oracle DBA on my day job, I completely agree
with Linus on the deranged monkey comparison in that mail...
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Denis Vlasenko
On Thursday 04 January 2007 17:19, Bill Davidsen wrote:
> Hugh Dickins wrote:
> In many cases the use of O_DIRECT is purely to avoid impact on cache 
> used by other applications. An application which writes a large quantity 
> of data will have less impact on other applications by using O_DIRECT, 
> assuming that the data will not be read from cache due to application 
> pattern or the data being much larger than physical memory.

But O_DIRECT is _not_ about cache. At least I think it was not about
cache initially, it was more about DMAing data directly from/to
application address space to/from disks, saving memcpy's and double
allocations. Why do you think it has that special alignment requirements?
Are they cache related? Not at all!

After that people started adding unrelated semantics on it -
"oh, we use O_DIRECT in our database code and it pushes EVERYTHING
else out of cache. This is bad. Let's overload O_DIRECT to also mean
'do not pollute the cache'. Here's the patch".

DB people from certain well-known commercial DB have zero coding
taste. No wonder their binaries are nearly 100 MB (!!!) in size...

In all fairness, O_DIRECT's direct-DMA makes is easier to implement
"do-not-cache-me" than to do it for generic read()/write()
(just because O_DIRECT is (was?) using different code path,
not integrated into VM cache machinery that much).

But _conceptually_ "direct DMAing" and "do-not-cache-me"
are orthogonal, right?

That's why we also have bona fide fadvise and madvise
with FADV_DONTNEED/MADV_DONTNEED:

http://www.die.net/doc/linux/man/man2/fadvise.2.html
http://www.die.net/doc/linux/man/man2/madvise.2.html

_This_ is the proper way to say "do not cache me".

I think tmpfs should just ignore O_DIRECT bit.
That won't require much coding.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Denis Vlasenko
On Thursday 04 January 2007 17:19, Bill Davidsen wrote:
 Hugh Dickins wrote:
 In many cases the use of O_DIRECT is purely to avoid impact on cache 
 used by other applications. An application which writes a large quantity 
 of data will have less impact on other applications by using O_DIRECT, 
 assuming that the data will not be read from cache due to application 
 pattern or the data being much larger than physical memory.

But O_DIRECT is _not_ about cache. At least I think it was not about
cache initially, it was more about DMAing data directly from/to
application address space to/from disks, saving memcpy's and double
allocations. Why do you think it has that special alignment requirements?
Are they cache related? Not at all!

After that people started adding unrelated semantics on it -
oh, we use O_DIRECT in our database code and it pushes EVERYTHING
else out of cache. This is bad. Let's overload O_DIRECT to also mean
'do not pollute the cache'. Here's the patch.

DB people from certain well-known commercial DB have zero coding
taste. No wonder their binaries are nearly 100 MB (!!!) in size...

In all fairness, O_DIRECT's direct-DMA makes is easier to implement
do-not-cache-me than to do it for generic read()/write()
(just because O_DIRECT is (was?) using different code path,
not integrated into VM cache machinery that much).

But _conceptually_ direct DMAing and do-not-cache-me
are orthogonal, right?

That's why we also have bona fide fadvise and madvise
with FADV_DONTNEED/MADV_DONTNEED:

http://www.die.net/doc/linux/man/man2/fadvise.2.html
http://www.die.net/doc/linux/man/man2/madvise.2.html

_This_ is the proper way to say do not cache me.

I think tmpfs should just ignore O_DIRECT bit.
That won't require much coding.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel + gcc 4.1 = several problems

2007-01-03 Thread Denis Vlasenko
On Wednesday 03 January 2007 21:38, Linus Torvalds wrote:
> On Wed, 3 Jan 2007, Denis Vlasenko wrote:
> > 
> > Why CPU people do not internally convert cmov into jmp,mov pair?
> 
...
> It really all boils down to: there's simply no real reason to use cmov. 
> It's not horrible either, so go ahead and use it if you want to, but don't 
> expect your code to really magically run any faster.

IOW: yet another slot in instruction opcode matrix and thousands of
transistors in instruction decoders are wasted because of this
"clever invention", eh?
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel + gcc 4.1 = several problems

2007-01-03 Thread Denis Vlasenko
On Wednesday 03 January 2007 17:03, Linus Torvalds wrote:
> On Wed, 3 Jan 2007, Grzegorz Kulewski wrote:
> > Could you explain why CMOV is pointless now? Are there any benchmarks 
> > proving
> > that?
> 
> CMOV (and, more generically, any "predicated instruction") tends to 
> generally a bad idea on an aggressively out-of-order CPU. It doesn't 
> always have to be horrible, but in practice it is seldom very nice, and 
> (as usual) on the P4 it can be really quite bad.
> 
> On a P4, I think a cmov basically takes 10 cycles.
> 
> But even ignoring the usual P4 "I suck at things that aren't totally 
> normal", cmov is actually not a great idea. You can always replace it by
> 
>   j forward
>   mov ..., %reg
>   forward:
...
...
> In contrast, if you use a predicated instruction, ALL of it is on the 
> critical path. Calculating the conditional is on the critical path. 
> Calculating the value that gets used is obviously ALSO on the critical 
> path, but so is the calculation for the value that DOESN'T get used too. 
> So the cmov - rather than speeding things up - actually slows things down, 
> because it makes more code be dependent on each other.

Why CPU people do not internally convert cmov into jmp,mov pair?
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel + gcc 4.1 = several problems

2007-01-03 Thread Denis Vlasenko
On Wednesday 03 January 2007 17:03, Linus Torvalds wrote:
 On Wed, 3 Jan 2007, Grzegorz Kulewski wrote:
  Could you explain why CMOV is pointless now? Are there any benchmarks 
  proving
  that?
 
 CMOV (and, more generically, any predicated instruction) tends to 
 generally a bad idea on an aggressively out-of-order CPU. It doesn't 
 always have to be horrible, but in practice it is seldom very nice, and 
 (as usual) on the P4 it can be really quite bad.
 
 On a P4, I think a cmov basically takes 10 cycles.
 
 But even ignoring the usual P4 I suck at things that aren't totally 
 normal, cmov is actually not a great idea. You can always replace it by
 
   jnegated condition forward
   mov ..., %reg
   forward:
...
...
 In contrast, if you use a predicated instruction, ALL of it is on the 
 critical path. Calculating the conditional is on the critical path. 
 Calculating the value that gets used is obviously ALSO on the critical 
 path, but so is the calculation for the value that DOESN'T get used too. 
 So the cmov - rather than speeding things up - actually slows things down, 
 because it makes more code be dependent on each other.

Why CPU people do not internally convert cmov into jmp,mov pair?
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel + gcc 4.1 = several problems

2007-01-03 Thread Denis Vlasenko
On Wednesday 03 January 2007 21:38, Linus Torvalds wrote:
 On Wed, 3 Jan 2007, Denis Vlasenko wrote:
  
  Why CPU people do not internally convert cmov into jmp,mov pair?
 
...
 It really all boils down to: there's simply no real reason to use cmov. 
 It's not horrible either, so go ahead and use it if you want to, but don't 
 expect your code to really magically run any faster.

IOW: yet another slot in instruction opcode matrix and thousands of
transistors in instruction decoders are wasted because of this
clever invention, eh?
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: replace "memset(...,0,PAGE_SIZE)" calls with "clear_page()"?

2006-12-30 Thread Denis Vlasenko
On Saturday 30 December 2006 23:08, Robert P. J. Day wrote:
> >
> > clear_page assumes that given address is page aligned, I think. It
> > may fail if you feed it with misaligned region's address.
> 
> i don't see how that can be true, given that most of the definitions
> of the clear_page() macro are simply invocations of memset().  see for
> yourself:
> 
>   $ grep -r "#define clear_page" include
> 
> my only point here was that lots of code seems to be calling memset()
> when it would be clearer to invoke clear_page().  but there's still
> something a bit curious happening here.  i'll poke around a bit more
> before i ask, though.

There are MMX implementations of clear_page().

I was experimenting with SSE[2] clear_page() which uses
non-temporal stores. That one requires 16 byte alignment.

BTW, it worked ~300% faster than memset. But Andi Kleen
insists that cache eviction caused by NT stores will make it
slower in macrobenchmark.

Apart from fairly extensive set of microbechmarks
I tested kernel compiles (i.e. "real world load")
and they are FASTER too, not slower, but Andi
is fairly entrenched in his opinion ;)
I gave up.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC,PATCHSET] Managed device resources

2006-12-30 Thread Denis Vlasenko
On Tuesday 26 December 2006 16:18, Tejun Heo wrote:
> Hello, all.
> 
> This patchset implements managed device resources, in short, devres.

I was working on a Linux device driver. Indeed, those error paths
are notoriously prone to bugs.

Patchset looks like good idea to me.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   >