Re: [PATCH] ext4: fix uniniatilized extend splitting error.

2008-01-11 Thread Amit K. Arora
On Thu, Jan 10, 2008 at 02:31:03PM -0700, Andreas Dilger wrote:
 On Jan 10, 2008  17:31 +0300, Dmitry Monakhov wrote:
  While playing with new fancy fallocate interface on ext4 i've triggered
  bug which corrupted my grub :).

 I notice I'm CC'd on this, but in fact Amit wrote the code.  I've CC'd
 him even though I expect he will notice it anyways.

Andreas, thanks for adding me to the CC list!
 
  My testcase:
  
  blksize = 0x1000;
  fd = open(argv[1], O_RDWR|O_CREAT, 0700);
  unsigned long long sz = 0x1000UL;
  /* allocating big blocks chunk */
  syscall(__NR_fallocate, fd, 0, 0UL, sz)
  
  /* grab all other available filesystem space */
  tfd = open(tmp, O_RDWR|O_CREAT|O_DIRECT, 0700);
  while( write(tfd, buf, 4096)  0); /* loop untill ENOSPC */
  fsync(fd); /* just in case */
  while (pos  sz) {
  /* each seek+ write operation result in splits uninitialized extent
  in three extents. Splitting may result in new extent allocation
  which probably will fail because of ENOSPC*/
  
  lseek(fd, blksize*2 -1, SEEK_CUR);
  if ((ret = write(fd, 'a', 1)) != 1)
  exit(1);
  pos += blksize * 2;
  }
 
 Interesting test, and well thought out...

Dmitry, Good catch and thanks for the patch below !
Please add Acked-by: Amit Arora [EMAIL PROTECTED].
 
 The other item that Amit and I discussed in the past is in the case of
 ENOSPC it would be possible instead of splitting the extent to zero-fill
 the smaller extent (1 block in your test case) and write the whole thing
 as an initialized extent.  This could then either be merged with the
 previous or following allocated extent, or the whole extent zeroed if that
 was not possible.

Yes, this is one of the things pending..
 
 It would add some latency in the worst case to do this in the kernel,
 but this would only happen if there is no free space at all.  It might
 even be desirable to always zero-fill small extents instead of splitting
 uninitialized extents, because a random write of 64kB is not more expensive
 than 4kB and avoids overhead of splitting the nicely contiguous extent tree.

I feel this is debatable and it may not be easy to define what extent size is
small enough. Anyhow, since we merge the extents when possible it should
not be too bad, unless someone deliberately writes to alternate blocks
in the uninitialized extent. Hence, as Mingming suggested, I too think that we
should be doing it only when necessary.

--
Regards,
Amit Arora
 
  Signed-off-by: Dmitry Monakhov [EMAIL PROTECTED]
  ---
   fs/ext4/extents.c |5 +++--
   1 files changed, 3 insertions(+), 2 deletions(-)
  
  diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
  index 8528774..fc8e508 100644
  --- a/fs/ext4/extents.c
  +++ b/fs/ext4/extents.c
  @@ -2320,9 +2320,10 @@ int ext4_ext_get_blocks(handle_t *handle, struct 
  inode *inode,
  ret = ext4_ext_convert_to_initialized(handle, inode,
  path, iblock,
  max_blocks);
  -   if (ret = 0)
  +   if (ret = 0) {
  +   err = ret;
  goto out2;
  -   else
  +   } else
  allocated = ret;
  goto outnew;
  }
  -- 
  1.5.3.1.40.g6972-dirty
  
  
  -
  To unsubscribe from this list: send the line unsubscribe linux-ext4 in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 Cheers, Andreas
 --
 Andreas Dilger
 Sr. Staff Engineer, Lustre Group
 Sun Microsystems of Canada, Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [Coding Style]: fs/ext{3,4}/ext{3,4}_jbd{,2}.c

2008-01-11 Thread Paul Mundt
On Fri, Jan 11, 2008 at 12:04:14PM +0100, Roel Kluin wrote:
 Paul Mundt wrote:
  Take a look at how CONFIG_PCMCIA_DEBUG is handled.
 
 In drivers/pcmcia/Makefile, when CONFIG_PCMCIA_DEBUG=y, it  gives
 EXTRA_CFLAGS += -DDEBUG
 which causes the definition of DEBUG as a macro, with definition 1.
 
  With DEBUG()-pr_debug() conversion here you've silently dropped the
  __func__ prefixing. Note that dev_dbg() is usually preferred when you can
  get a hold of a struct device pointer, as it takes care of prettifying
  the output with the driver name and so on, rather than the convention of
  adding a prefix. If you can't get at the struct device pointer, you'll
  probably just want to insert the __func__ prefixing manually at the
  callsites.
 
 Ah, ok, then this should be right:
 --
 Replace printk wrapper - with a syntax error - by pr_debug.
 DEBUG is defined 1 when CONFIG_PCMCIA_DEBUG is set.
 
 Signed-off-by: Roel Kluin [EMAIL PROTECTED]
 ---
 diff --git a/drivers/pcmcia/au1000_xxs1500.c b/drivers/pcmcia/au1000_xxs1500.c
 index ce9d5c4..8e6426b 100644
 --- a/drivers/pcmcia/au1000_xxs1500.c
 +++ b/drivers/pcmcia/au1000_xxs1500.c
 @@ -55,12 +55,6 @@
  #define PCMCIA_NUM_SOCKS (PCMCIA_MAX_SOCK + 1)
  #define PCMCIA_IRQ   AU1000_GPIO_4
  
 -#if 0
 -#define DEBUG(x,args...) printk(__FUNCTION__ :  x,##args)
 -#else
 -#define DEBUG(x,args...)
 -#endif
 -
  static int xxs1500_pcmcia_init(struct pcmcia_init *init)
  {
   return PCMCIA_NUM_SOCKS;
 @@ -143,13 +137,13 @@ xxs1500_pcmcia_configure_socket(const struct 
 pcmcia_configure *configure)
  
   if(configure-sock  PCMCIA_MAX_SOCK) return -1;
  
 - DEBUG(Vcc %dV Vpp %dV, reset %d\n,
 + pr_debug(Vcc %dV Vpp %dV, reset %d\n,
   configure-vcc, configure-vpp, configure-reset);
  
You're still changing the semantics here. The DEBUG() does __FUNCTION__
prefixing, while pr_debug() does not. This should be something like
pr_debug(%s: , __func__, ...); instead, if you want to maintain
consistency. Beyond that, this looks fine, yes.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [Coding Style]: fs/ext{3,4}/ext{3,4}_jbd{,2}.c

2008-01-11 Thread Roel Kluin
Paul Mundt wrote:
 On Fri, Jan 11, 2008 at 10:45:30AM +0100, Roel Kluin wrote:
 Paul Mundt wrote:
 On Fri, Jan 11, 2008 at 04:09:45AM +0100, Peter Stuge wrote:
 On Thu, Jan 10, 2008 at 10:03:58PM +0100, Roel Kluin wrote:
 -#define DEBUG(x,args...) printk(__FUNCTION__ :  x,##args)
 +#define DEBUG(x, args...)printk(%s: , __func__, x, ##args)
 Can this really be expected to work when x contains conversions?

 How about:

 #define DEBUG(x, args...) printk(%s:  x, __func__, ##args)

 How about throwing out hand-rolled debug printk wrappers for the
 brain-damage they are and using the ones the kernel provides instead?

 Should it be done like this?
 --
 Replace printk wrapper - with a syntax error - by pr_debug

 Signed-off-by: Roel Kluin [EMAIL PROTECTED]
 
 Close. But in this case #define DEBUG is already instrumented by the
 subsystem-wide debug option if it's enabled, so it's preferable to use
 that and just drop the special debug Kconfig option completely.
 
 Take a look at how CONFIG_PCMCIA_DEBUG is handled.

In drivers/pcmcia/Makefile, when CONFIG_PCMCIA_DEBUG=y, it  gives
EXTRA_CFLAGS += -DDEBUG
which causes the definition of DEBUG as a macro, with definition 1.

 With DEBUG()-pr_debug() conversion here you've silently dropped the
 __func__ prefixing. Note that dev_dbg() is usually preferred when you can
 get a hold of a struct device pointer, as it takes care of prettifying
 the output with the driver name and so on, rather than the convention of
 adding a prefix. If you can't get at the struct device pointer, you'll
 probably just want to insert the __func__ prefixing manually at the
 callsites.

Ah, ok, then this should be right:
--
Replace printk wrapper - with a syntax error - by pr_debug.
DEBUG is defined 1 when CONFIG_PCMCIA_DEBUG is set.

Signed-off-by: Roel Kluin [EMAIL PROTECTED]
---
diff --git a/drivers/pcmcia/au1000_xxs1500.c b/drivers/pcmcia/au1000_xxs1500.c
index ce9d5c4..8e6426b 100644
--- a/drivers/pcmcia/au1000_xxs1500.c
+++ b/drivers/pcmcia/au1000_xxs1500.c
@@ -55,12 +55,6 @@
 #define PCMCIA_NUM_SOCKS   (PCMCIA_MAX_SOCK + 1)
 #define PCMCIA_IRQ AU1000_GPIO_4
 
-#if 0
-#define DEBUG(x,args...)   printk(__FUNCTION__ :  x,##args)
-#else
-#define DEBUG(x,args...)
-#endif
-
 static int xxs1500_pcmcia_init(struct pcmcia_init *init)
 {
return PCMCIA_NUM_SOCKS;
@@ -143,13 +137,13 @@ xxs1500_pcmcia_configure_socket(const struct 
pcmcia_configure *configure)
 
if(configure-sock  PCMCIA_MAX_SOCK) return -1;
 
-   DEBUG(Vcc %dV Vpp %dV, reset %d\n,
+   pr_debug(Vcc %dV Vpp %dV, reset %d\n,
configure-vcc, configure-vpp, configure-reset);
 
switch(configure-vcc){
case 33: /* Vcc 3.3V */
/* turn on power */
-   DEBUG(turn on power\n);
+   pr_debug(turn on power\n);
au_writel((au_readl(GPIO2_PINSTATE)  ~(114))|(130),
GPIO2_OUTPUT);
au_sync_delay(100);
@@ -166,7 +160,7 @@ xxs1500_pcmcia_configure_socket(const struct 
pcmcia_configure *configure)
}
 
if (!configure-reset) {
-   DEBUG(deassert reset\n);
+   pr_debug(deassert reset\n);
au_writel((au_readl(GPIO2_PINSTATE)  ~(14))|(120),
GPIO2_OUTPUT);
au_sync_delay(100);
@@ -174,7 +168,7 @@ xxs1500_pcmcia_configure_socket(const struct 
pcmcia_configure *configure)
GPIO2_OUTPUT);
}
else {
-   DEBUG(assert reset\n);
+   pr_debug(assert reset\n);
au_writel(au_readl(GPIO2_PINSTATE) | (14)|(120),
GPIO2_OUTPUT);
}
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [Coding Style]: fs/ext{3,4}/ext{3,4}_jbd{,2}.c

2008-01-11 Thread Paul Mundt
On Fri, Jan 11, 2008 at 10:45:30AM +0100, Roel Kluin wrote:
 Paul Mundt wrote:
  On Fri, Jan 11, 2008 at 04:09:45AM +0100, Peter Stuge wrote:
  On Thu, Jan 10, 2008 at 10:03:58PM +0100, Roel Kluin wrote:
  -#define DEBUG(x,args...) printk(__FUNCTION__ :  x,##args)
  +#define DEBUG(x, args...)printk(%s: , __func__, x, ##args)
  Can this really be expected to work when x contains conversions?
 
  How about:
 
  #define DEBUG(x, args...) printk(%s:  x, __func__, ##args)
 
  How about throwing out hand-rolled debug printk wrappers for the
  brain-damage they are and using the ones the kernel provides instead?
  
 Should it be done like this?
 --
 Replace printk wrapper - with a syntax error - by pr_debug
 
 Signed-off-by: Roel Kluin [EMAIL PROTECTED]

Close. But in this case #define DEBUG is already instrumented by the
subsystem-wide debug option if it's enabled, so it's preferable to use
that and just drop the special debug Kconfig option completely.

Take a look at how CONFIG_PCMCIA_DEBUG is handled.

With DEBUG()-pr_debug() conversion here you've silently dropped the
__func__ prefixing. Note that dev_dbg() is usually preferred when you can
get a hold of a struct device pointer, as it takes care of prettifying
the output with the driver name and so on, rather than the convention of
adding a prefix. If you can't get at the struct device pointer, you'll
probably just want to insert the __func__ prefixing manually at the
callsites.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [Coding Style]: fs/ext{3,4}/ext{3,4}_jbd{,2}.c

2008-01-11 Thread Roel Kluin
Paul Mundt wrote:
 On Fri, Jan 11, 2008 at 04:09:45AM +0100, Peter Stuge wrote:
 On Thu, Jan 10, 2008 at 10:03:58PM +0100, Roel Kluin wrote:
 -#define DEBUG(x,args...)   printk(__FUNCTION__ :  x,##args)
 +#define DEBUG(x, args...)  printk(%s: , __func__, x, ##args)
 Can this really be expected to work when x contains conversions?

 How about:

 #define DEBUG(x, args...) printk(%s:  x, __func__, ##args)

 How about throwing out hand-rolled debug printk wrappers for the
 brain-damage they are and using the ones the kernel provides instead?
 
Should it be done like this?
--
Replace printk wrapper - with a syntax error - by pr_debug

Signed-off-by: Roel Kluin [EMAIL PROTECTED]
---
diff --git a/drivers/pcmcia/au1000_xxs1500.c b/drivers/pcmcia/au1000_xxs1500.c
index ce9d5c4..4c32389 100644
--- a/drivers/pcmcia/au1000_xxs1500.c
+++ b/drivers/pcmcia/au1000_xxs1500.c
@@ -25,6 +25,10 @@
  *
  *
  */
+#ifdef CONFIG_MIPS_XXS1500_DEBUG
+#define DEBUG 1
+#endif
+
 #include linux/module.h
 #include linux/init.h
 #include linux/delay.h
@@ -55,11 +59,6 @@
 #define PCMCIA_NUM_SOCKS   (PCMCIA_MAX_SOCK + 1)
 #define PCMCIA_IRQ AU1000_GPIO_4
 
-#if 0
-#define DEBUG(x,args...)   printk(__FUNCTION__ :  x,##args)
-#else
-#define DEBUG(x,args...)
-#endif
 
 static int xxs1500_pcmcia_init(struct pcmcia_init *init)
 {
@@ -143,13 +142,13 @@ xxs1500_pcmcia_configure_socket(const struct 
pcmcia_configure *configure)
 
if(configure-sock  PCMCIA_MAX_SOCK) return -1;
 
-   DEBUG(Vcc %dV Vpp %dV, reset %d\n,
+   pr_debug(Vcc %dV Vpp %dV, reset %d\n,
configure-vcc, configure-vpp, configure-reset);
 
switch(configure-vcc){
case 33: /* Vcc 3.3V */
/* turn on power */
-   DEBUG(turn on power\n);
+   pr_debug(turn on power\n);
au_writel((au_readl(GPIO2_PINSTATE)  ~(114))|(130),
GPIO2_OUTPUT);
au_sync_delay(100);
@@ -166,7 +165,7 @@ xxs1500_pcmcia_configure_socket(const struct 
pcmcia_configure *configure)
}
 
if (!configure-reset) {
-   DEBUG(deassert reset\n);
+   pr_debug(deassert reset\n);
au_writel((au_readl(GPIO2_PINSTATE)  ~(14))|(120),
GPIO2_OUTPUT);
au_sync_delay(100);
@@ -174,7 +173,7 @@ xxs1500_pcmcia_configure_socket(const struct 
pcmcia_configure *configure)
GPIO2_OUTPUT);
}
else {
-   DEBUG(assert reset\n);
+   pr_debug(assert reset\n);
au_writel(au_readl(GPIO2_PINSTATE) | (14)|(120),
GPIO2_OUTPUT);
}

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [Coding Style]: fs/ext{3,4}/ext{3,4}_jbd{,2}.c

2008-01-11 Thread Roel Kluin
Paul Mundt wrote:
 On Fri, Jan 11, 2008 at 12:04:14PM +0100, Roel Kluin wrote:
 Paul Mundt wrote:
 Take a look at how CONFIG_PCMCIA_DEBUG is handled.
 In drivers/pcmcia/Makefile, when CONFIG_PCMCIA_DEBUG=y, it  gives
 EXTRA_CFLAGS += -DDEBUG
 which causes the definition of DEBUG as a macro, with definition 1.

 With DEBUG()-pr_debug() conversion here you've silently dropped the
 __func__ prefixing. Note that dev_dbg() is usually preferred when you can
 get a hold of a struct device pointer, as it takes care of prettifying
 the output with the driver name and so on, rather than the convention of
 adding a prefix. If you can't get at the struct device pointer, you'll
 probably just want to insert the __func__ prefixing manually at the
 callsites.

 You're still changing the semantics here. The DEBUG() does __FUNCTION__
 prefixing, while pr_debug() does not. This should be something like
 pr_debug(%s: , __func__, ...); instead, if you want to maintain
 consistency. Beyond that, this looks fine, yes.

Somehow I overlooked your last point. Well, the original code had no
semantics - it wasn't working - but I get your point.

Below a patch to print the __func__'s. Note that all pr_debugs are in the
same function. Maybe the function name should be printed in the first
pr_debug only? If desired, I'll send another patch.
--
Replace printk wrapper - with a syntax error - by pr_debug; keep printing
the __func__'s. (DEBUG is defined 1 when CONFIG_PCMCIA_DEBUG is set)

Signed-off-by: Roel Kluin [EMAIL PROTECTED]
---
diff --git a/drivers/pcmcia/au1000_xxs1500.c b/drivers/pcmcia/au1000_xxs1500.c
index ce9d5c4..b4bad6e 100644
--- a/drivers/pcmcia/au1000_xxs1500.c
+++ b/drivers/pcmcia/au1000_xxs1500.c
@@ -55,12 +55,6 @@
 #define PCMCIA_NUM_SOCKS   (PCMCIA_MAX_SOCK + 1)
 #define PCMCIA_IRQ AU1000_GPIO_4
 
-#if 0
-#define DEBUG(x,args...)   printk(__FUNCTION__ :  x,##args)
-#else
-#define DEBUG(x,args...)
-#endif
-
 static int xxs1500_pcmcia_init(struct pcmcia_init *init)
 {
return PCMCIA_NUM_SOCKS;
@@ -143,13 +137,13 @@ xxs1500_pcmcia_configure_socket(const struct 
pcmcia_configure *configure)
 
if(configure-sock  PCMCIA_MAX_SOCK) return -1;
 
-   DEBUG(Vcc %dV Vpp %dV, reset %d\n,
+   pr_debug(%s: Vcc %dV Vpp %dV, reset %d\n, __func__,
configure-vcc, configure-vpp, configure-reset);
 
switch(configure-vcc){
case 33: /* Vcc 3.3V */
/* turn on power */
-   DEBUG(turn on power\n);
+   pr_debug(%s: turn on power\n, __func__);
au_writel((au_readl(GPIO2_PINSTATE)  ~(114))|(130),
GPIO2_OUTPUT);
au_sync_delay(100);
@@ -166,7 +160,7 @@ xxs1500_pcmcia_configure_socket(const struct 
pcmcia_configure *configure)
}
 
if (!configure-reset) {
-   DEBUG(deassert reset\n);
+   pr_debug(%s: deassert reset\n, __func__);
au_writel((au_readl(GPIO2_PINSTATE)  ~(14))|(120),
GPIO2_OUTPUT);
au_sync_delay(100);
@@ -174,7 +168,7 @@ xxs1500_pcmcia_configure_socket(const struct 
pcmcia_configure *configure)
GPIO2_OUTPUT);
}
else {
-   DEBUG(assert reset\n);
+   pr_debug(%s: assert reset\n, __func__);
au_writel(au_readl(GPIO2_PINSTATE) | (14)|(120),
GPIO2_OUTPUT);
}



-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problems with mballoc and uninit_groups option

2008-01-11 Thread Aneesh Kumar K.V
On Fri, Jan 11, 2008 at 03:04:08PM +0100, Valerie Clement wrote:
 Hi,

 I've got problems with mballoc when I create the ext4 filesystem with  
 the uninit_groups option enabled.

 First, I do a single test on a filesystem created without the  
 uninit_groups option and mounted with the defaults option:
   dd if=/dev/zero of=/mnt/test/foo bs=1M count=1024

 In this case, the file blocks are allocated in the groups 4, 5, 6, 7, 8,  
 9, 10.

 When the filesystem is created with the uninit_groups option enabled  
 and mounted with the defaults option, I do the same dd command.

 In this case, the file blocks are allocated in the groups 5, 7, 9, 25,  
 27, 49, 81. It seems that the blocks could be allocated only in the  
 already initialized groups.


That is because we skip the uninitialized group in ext4_mb_good_group.
I guess we are trying criteria 0 allocation and we skip uninit group for
criteria 0 . You can tune by setting higher value for
/proc/fs/ext4/partition/orders2_req. Setting it to a high value would
skip criteria 0 allocation for small requests.

I guess you are using delayed allocation ?

-aneesh

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Problems with mballoc and uninit_groups option

2008-01-11 Thread Valerie Clement

Hi,

I've got problems with mballoc when I create the ext4 filesystem with 
the uninit_groups option enabled.


First, I do a single test on a filesystem created without the 
uninit_groups option and mounted with the defaults option:

  dd if=/dev/zero of=/mnt/test/foo bs=1M count=1024

In this case, the file blocks are allocated in the groups 4, 5, 6, 7, 8, 
9, 10.


When the filesystem is created with the uninit_groups option enabled 
and mounted with the defaults option, I do the same dd command.


In this case, the file blocks are allocated in the groups 5, 7, 9, 25, 
27, 49, 81. It seems that the blocks could be allocated only in the 
already initialized groups.


Who knows how to fix that?

  Valérie
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] e2fsprogs: New bitmap and inode table allocation for FLEX_BG

2008-01-11 Thread Jose R. Santos
commit 38a4134f29b06229843bfe838c23e28f8d323b86
Author: Jose R. Santos [EMAIL PROTECTED]
Date:   Fri Jan 11 11:03:03 2008 -0600

New bitmap and inode table allocation for FLEX_BG

Change the way we allocate bitmaps and inode tables if the FLEX_BG
feature is used at mke2fs time.  It places calculates a new offset for
bitmaps and inode table base on the number of groups that the user
wishes to pack together using the new -G option.  Creating a
filesystem with 64 block groups in a flex group can be done by:

mke2fs -j -I 256 -O flex_bg -G 32 /dev/sdX

Signed-off-by: Jose R. Santos [EMAIL PROTECTED]
Singed-off-by: Valerie Clement [EMAIL PROTECTED]

diff --git a/lib/ext2fs/alloc_tables.c b/lib/ext2fs/alloc_tables.c
index 4ad2ba9..f85ef97 100644
--- a/lib/ext2fs/alloc_tables.c
+++ b/lib/ext2fs/alloc_tables.c
@@ -27,18 +27,55 @@
 #include ext2_fs.h
 #include ext2fs.h
 
+blk_t ext2fs_flexbg_offset(ext2_filsys fs, dgrp_t group, int flexbg_size, 
+  ext2fs_block_bitmap bmap, int offset, int size)
+{
+   int flexbg;
+   errcode_t   retval;
+   blk_t   start_blk, last_blk, first_free = 0;
+   dgrp_t  last_grp;
+   
+   flexbg = group / flexbg_size;
+
+   last_grp = (group + (flexbg_size - (group % flexbg_size) - 1));
+   start_blk = ext2fs_group_first_block(fs, flexbg_size * flexbg);
+   last_blk = ext2fs_group_last_block(fs, last_grp);
+   if (last_grp  fs-group_desc_count)
+   last_grp = fs-group_desc_count;
+
+   if (ext2fs_get_free_blocks(fs, start_blk, last_blk, 1, bmap, 
+  first_free))
+   return first_free;
+   
+   if (ext2fs_get_free_blocks(fs, first_free + offset, last_blk, size, 
+  bmap, first_free))
+   return first_free;
+
+   return first_free;
+}
+
 errcode_t ext2fs_allocate_group_table(ext2_filsys fs, dgrp_t group,
  ext2fs_block_bitmap bmap)
 {
errcode_t   retval;
blk_t   group_blk, start_blk, last_blk, new_blk, blk;
-   int j;
+   dgrp_t  last_grp;
+   int j, rem_grps, flexbg_size = 0;
 
group_blk = ext2fs_group_first_block(fs, group);
last_blk = ext2fs_group_last_block(fs, group);
 
if (!bmap)
bmap = fs-block_map;
+
+   if (EXT2_HAS_INCOMPAT_FEATURE (fs-super,
+  EXT4_FEATURE_INCOMPAT_FLEX_BG)) {
+   flexbg_size = 1  fs-super-s_log_groups_per_flex;
+   rem_grps = flexbg_size - (group % flexbg_size);
+   last_grp = group + rem_grps - 1;
+   if (last_grp  fs-group_desc_count)
+   last_grp = fs-group_desc_count;
+   }

/*
 * Allocate the block and inode bitmaps, if necessary
@@ -56,6 +93,12 @@ errcode_t ext2fs_allocate_group_table(ext2_filsys fs, dgrp_t 
group,
} else
start_blk = group_blk;
 
+   if (flexbg_size) {
+   start_blk = ext2fs_flexbg_offset (fs, group, flexbg_size, bmap,
+ 0, rem_grps);
+   last_blk = ext2fs_group_last_block(fs, last_grp);
+   }
+
if (!fs-group_desc[group].bg_block_bitmap) {
retval = ext2fs_get_free_blocks(fs, start_blk, last_blk,
1, bmap, new_blk);
@@ -68,6 +111,12 @@ errcode_t ext2fs_allocate_group_table(ext2_filsys fs, 
dgrp_t group,
fs-group_desc[group].bg_block_bitmap = new_blk;
}
 
+   if (flexbg_size) {
+   start_blk = ext2fs_flexbg_offset (fs, group, flexbg_size, bmap,
+ flexbg_size, rem_grps);
+   last_blk = ext2fs_group_last_block(fs, last_grp);
+   }
+
if (!fs-group_desc[group].bg_inode_bitmap) {
retval = ext2fs_get_free_blocks(fs, start_blk, last_blk,
1, bmap, new_blk);
@@ -83,6 +132,13 @@ errcode_t ext2fs_allocate_group_table(ext2_filsys fs, 
dgrp_t group,
/*
 * Allocate the inode table
 */
+   if (flexbg_size) {
+   group_blk = ext2fs_flexbg_offset (fs, group, flexbg_size, bmap,
+ flexbg_size * 2,
+ fs-inode_blocks_per_group * 
rem_grps);
+   last_blk = ext2fs_group_last_block(fs, last_grp);
+   }
+
if (!fs-group_desc[group].bg_inode_table) {
retval = ext2fs_get_free_blocks(fs, group_blk, last_blk,
fs-inode_blocks_per_group,
@@ -112,6 +168,7 @@ errcode_t ext2fs_allocate_tables(ext2_filsys fs)
if (retval)
return retval;
  

[PATCH] New inode allocation for FLEX_BG meta-data groups.

2008-01-11 Thread Jose R. Santos
commit 8eef19455beb97319a78511b35b1da42a1d48eb2
Author: Jose R. Santos [EMAIL PROTECTED]
Date:   Fri Jan 11 11:04:25 2008 -0600

New inode allocation for FLEX_BG meta-data groups.

This patch mostly controls the way inode are allocated in order to
make ialloc aware of flex_bg block group grouping.  It achieves this
by bypassing the Orlov allocator when block group meta-data are packed
toghether through mke2fs.  Since the impact on the block allocator is
minimal, this patch should have little or no effect on other block
allocation algorithms. By controlling the inode allocation, it can
basically control where the initial search for new block begins and
thus indirectly manipulate the block allocator.

This allocator favors data and meta-data locality so the disk will
gradually be filled from block group zero upward.  This helps improve
performance by reducing seek time.  Since the group of inode tables
within one flex_bg are treated as one giant inode table, uninitialized
block groups would not need to partially initialize as many inode
table as with Orlov which would help fsck time as the filesystem usage
goes up.

Singed-off-by: Jose R. Santos [EMAIL PROTECTED]
Singed-off-by: Valerie Clement [EMAIL PROTECTED]

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 643046b..aed3456 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -127,6 +127,8 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, 
struct buffer_head *bh,
mark_bitmap_end(group_blocks, sb-s_blocksize * 8, bh-b_data);
}
 
+   if (sbi-s_log_groups_per_flex)
+   return free_blocks;
return free_blocks - sbi-s_itb_per_group - 2;
 }
 
@@ -759,6 +761,13 @@ do_more:
spin_unlock(sb_bgl_lock(sbi, block_group));
percpu_counter_add(sbi-s_freeblocks_counter, count);
 
+   if (sbi-s_log_groups_per_flex) {
+   ext4_group_t flex_group = ext4_flex_group(sbi, block_group);
+   spin_lock(sb_bgl_lock(sbi, flex_group));
+   sbi-s_flex_groups[flex_group].free_blocks += count;
+   spin_unlock(sb_bgl_lock(sbi, flex_group));
+   }
+
/* We dirtied the bitmap block */
BUFFER_TRACE(bitmap_bh, dirtied bitmap block);
err = ext4_journal_dirty_metadata(handle, bitmap_bh);
@@ -1829,6 +1838,13 @@ allocated:
spin_unlock(sb_bgl_lock(sbi, group_no));
percpu_counter_sub(sbi-s_freeblocks_counter, num);
 
+   if (sbi-s_log_groups_per_flex) {
+   ext4_group_t flex_group = ext4_flex_group(sbi, group_no);
+   spin_lock(sb_bgl_lock(sbi, flex_group));
+   sbi-s_flex_groups[flex_group].free_blocks -= num;
+   spin_unlock(sb_bgl_lock(sbi, flex_group));
+   }
+
BUFFER_TRACE(gdp_bh, journal_dirty_metadata for group descriptor);
err = ext4_journal_dirty_metadata(handle, gdp_bh);
if (!fatal)
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 575b521..d4e8dea 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -158,6 +158,7 @@ void ext4_free_inode (handle_t *handle, struct inode * 
inode)
struct ext4_super_block * es;
struct ext4_sb_info *sbi;
int fatal = 0, err;
+   ext4_group_t flex_group;
 
if (atomic_read(inode-i_count)  1) {
printk (ext4_free_inode: inode has count=%d\n,
@@ -235,6 +236,12 @@ void ext4_free_inode (handle_t *handle, struct inode * 
inode)
if (is_directory)
percpu_counter_dec(sbi-s_dirs_counter);
 
+   if (sbi-s_log_groups_per_flex) {
+   flex_group = ext4_flex_group(sbi, block_group);
+   spin_lock(sb_bgl_lock(sbi, flex_group));
+   sbi-s_flex_groups[flex_group].free_inodes++;
+   spin_unlock(sb_bgl_lock(sbi, flex_group));
+   }
}
BUFFER_TRACE(bh2, call ext4_journal_dirty_metadata);
err = ext4_journal_dirty_metadata(handle, bh2);
@@ -289,6 +296,75 @@ static int find_group_dir(struct super_block *sb, struct 
inode *parent,
return ret;
 }
 
+#define free_block_ratio 10
+
+static int find_group_flex(struct super_block *sb, struct inode *parent, 
ext4_group_t *best_group)
+{
+   struct ext4_sb_info *sbi = EXT4_SB(sb);
+   struct ext4_group_desc *desc;
+   struct buffer_head *bh;
+   struct flex_groups *flex_group = sbi-s_flex_groups;
+   ext4_group_t parent_group = EXT4_I(parent)-i_block_group;
+   ext4_group_t parent_fbg_group = ext4_flex_group(sbi, parent_group);
+   ext4_group_t ngroups = sbi-s_groups_count;
+   int flex_size = ext4_flex_bg_size(sbi);
+   ext4_group_t best_flex = parent_fbg_group;
+   int blocks_per_flex = sbi-s_blocks_per_group * flex_size;
+   int flex_freeb_ratio;

[Fwd: [Bug 9732] New: oops in extent code via ext4_fallocate]

2008-01-11 Thread Eric Sandeen
---BeginMessage---
http://bugzilla.kernel.org/show_bug.cgi?id=9732

   Summary: oops in extent code via ext4_fallocate
   Product: File System
   Version: 2.5
 KernelVersion: 2.6.24-rc7
  Platform: All
OS/Version: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: ext4
AssignedTo: [EMAIL PROTECTED]
ReportedBy: [EMAIL PROTECTED]


Latest working kernel version: Unknown
Earliest failing kernel version: 2.6.24-rc7
Distribution: Fedora Rawhide
Hardware Environment: x86_64
Software Environment:
Problem Description:

a simple fallocate; truncate; fallocate series oopses the kernel.

2.6.24-rc7 with the latest ext4 patch stack as of about Jan 9, 2008, minus
delalloc patches.

Steps to reproduce, using the test program at
http://oss.sgi.com/archives/xfs/2007-07/msg00092.html

[EMAIL PROTECTED] sdb8]# touch testfile
[EMAIL PROTECTED] sdb8]# ./testfallocate -f testfile 0 65536
Trying to preallocate blocks (offset=0, len=65536)
fallocate system call succedded !  ret=0
# FALLOCATE TEST REPORT #
New blocks preallocated = 16.
Number of bytes preallocated = 65536
Old file size = 0, New file size 65536.
Old num blocks = 0, New num blocks 64.


### TESTS PASSED ###
[EMAIL PROTECTED] sdb8]# ls  -lh testfile; du -hc testfile
-rw-r--r-- 1 root root 64K Jan 11 14:12 testfile
64K testfile
64K total
[EMAIL PROTECTED] sdb8]# /root/truncate testfile 32768
Truncating testfile to 32768
[EMAIL PROTECTED] sdb8]# ls  -lh testfile; du -hc testfile
-rw-r--r-- 1 root root 32K Jan 11 14:12 testfile
32K testfile
32K total
[EMAIL PROTECTED] sdb8]# ./testfallocate -f testfile 0 65536
Trying to preallocate blocks (offset=0, len=65536)
Segmentation fault


and yields:
EXT4-fs: file extents enabled
EXT4-fs: mballoc enabled
[ cut here ]
kernel BUG at fs/ext4/extents.c:1056!
invalid opcode:  [1] SMP 
CPU 2 
Modules linked in: ext4dev jbd2 crc16 autofs4 hidp nfs lockd nfs_acl rfcomm
l2cap bluetooth sunrpc ipv6 cpufreq_ondemand dm_multipath video output sbs
sbshc battery ac power_supply parport_pc lp parport sg pata_acpi
cfi_cmdset_0002 ata_generic cfi_util button jedec_probe serio_raw cfi_probe tg3
gen_probe ck804xrom mtd rtc_cmos chipreg pata_amd map_funcs k8temp libata hwmon
i2c_nforce2 shpchp i2c_core pcspkr dm_snapshot dm_zero dm_mirror dm_mod qla2xxx
scsi_transport_fc scsi_tgt mptspi mptscsih mptbase scsi_transport_spi sd_mod
scsi_mod ext3 jbd mbcache ehci_hcd ohci_hcd uhci_hcd
Pid: 3554, comm: testfallocate Not tainted 2.6.24-0.147.rc7.git2.fc9 #1
RIP: 0010:[88364497]  [88364497]
:ext4dev:ext4_ext_search_left+0x97/0xbc
RSP: 0018:81012dc65cf0  EFLAGS: 00010287
RAX: 8008 RBX: 81012dc65d78 RCX: 
RDX: 81012dc65d90 RSI: 81012e5f37e0 RDI: 81012e04c00c
RBP: 81012e04c180 R08: 0008 R09: 
R10:  R11: 81012dc65d98 R12: 0008
R13: 81012e04c180 R14: 0008 R15: 
FS:  2aab96f0() GS:81013fc01a28() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 00327d0414a0 CR3: 000130ce8000 CR4: 06a0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process testfallocate (pid: 3554, threadinfo 81012dc64000, task
81012dd5a000)
Stack:  88366073 8101349e71d0 81053c91 00022dcc30b8
 81012dc65e88 0008 81012e549000 81012dcc30b8
 81012dcc3090 81012e04c000 81012dcc30b8 0008
Call Trace:
 [88366073] :ext4dev:ext4_ext_get_blocks+0x5ba/0x8c1
 [81053c91] lock_release_holdtime+0x27/0x49
 [812748f6] _spin_unlock+0x17/0x20
 [883400a6] :jbd2:start_this_handle+0x4e0/0x4fe
 [88366564] :ext4dev:ext4_fallocate+0x175/0x39a
 [81053c91] lock_release_holdtime+0x27/0x49
 [81056480] __lock_acquire+0x4e7/0xc4d
 [81053c91] lock_release_holdtime+0x27/0x49
 [810a8de7] sys_fallocate+0xe4/0x10d
 [8100c043] tracesys+0xd5/0xda


Code: 0f 0b eb fe ff c8 89 02 0f b7 47 06 8b 4f 08 0f b7 57 04 48 
RIP  [88364497] :ext4dev:ext4_ext_search_left+0x97/0xbc
 RSP 81012dc65cf0
---[ end trace 9a60a6a6c694770a ]---
SysRq : Resetting


The BUG_ON is:

BUG_ON(*logical  le32_to_cpu(ex-ee_block) + le16_to_cpu(ex-ee_len));

where these were the values:

logical 8 ee_block 0 ee_len 32776

Haven't looked further into it yet.


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You reported the bug, or are watching the reporter.
---End Message---


Re: [PATCH] e2fsprogs: New bitmap and inode table allocation for FLEX_BG

2008-01-11 Thread Andreas Dilger
On Jan 11, 2008  11:28 -0600, Jose R. Santos wrote:
 +blk_t ext2fs_flexbg_offset(ext2_filsys fs, dgrp_t group, int flexbg_size, 
 +ext2fs_block_bitmap bmap, int offset, int size)

Could you please add some comments for what this function is trying to do?

 + last_grp = (group + (flexbg_size - (group % flexbg_size) - 1));

Is this the same as:

last_grp = group + (flexbg_size - 1) / flexbg_size * flexbg_size

(i.e. trying to round up to the next even multiple of flexbg_size)?

Didn't we decide to have flexbg_size be a power-of-two value, so we could
use shift and mask instead of divide and mod?  It's less of an issue because
group is only a 32-bit value, I guess.

 + if (ext2fs_get_free_blocks(fs, start_blk, last_blk, 1, bmap, 
 +first_free))
 + return first_free;
 + 
 + if (ext2fs_get_free_blocks(fs, first_free + offset, last_blk, size, 
 +bmap, first_free))
 + return first_free;
 +
 + return first_free;
 +}
 +
  errcode_t ext2fs_allocate_group_table(ext2_filsys fs, dgrp_t group,
 ext2fs_block_bitmap bmap)
  {
   errcode_t   retval;
   blk_t   group_blk, start_blk, last_blk, new_blk, blk;
 - int j;
 + dgrp_t  last_grp;
 + int j, rem_grps, flexbg_size = 0;
  
   group_blk = ext2fs_group_first_block(fs, group);
   last_blk = ext2fs_group_last_block(fs, group);
  
   if (!bmap)
   bmap = fs-block_map;
 +
 + if (EXT2_HAS_INCOMPAT_FEATURE (fs-super,
 +EXT4_FEATURE_INCOMPAT_FLEX_BG)) {
 + flexbg_size = 1  fs-super-s_log_groups_per_flex;
 + rem_grps = flexbg_size - (group % flexbg_size);

Hmm, no point in doing groups % flexbg_size if we have
s_log_groups_per_flex.  Could do groups  (flexbg_size - 1) instead.

 @@ -101,7 +102,11 @@ int ext2fs_super_and_bgd_loc(ext2_filsys fs,
 + if (flex_bg_size) {
 + if ((group % flex_bg_size) == 0)
 + numblocks -= 2 + fs-inode_blocks_per_group;

Ditto.

 @@ -1045,6 +1046,20 @@ static void PRS(int argc, char *argv[])
   exit(1);
   }
   break;
 + case 'G':
 + flex_bg_size = strtoul(optarg, tmp, 0);
 + if (*tmp) {
 + com_err(program_name, 0,
 + _(Illegal number for Flex_BG size));
 + exit(1);
 + }
 + if (flex_bg_size  2 || 
 + (flex_bg_size  (flex_bg_size-1)) != 0) {
 + com_err(program_name, 0,
 + _(Flex_BG size must be a power of 2));
 + exit(1);
 + }
 + break;

We've been putting new options under -E var=value...  I don't know what
Ted's thoughs are on using new option letters, though this one might qualify.

 @@ -1444,6 +1459,16 @@ static void PRS(int argc, char *argv[])
   }
   }
  
 + if(flex_bg_size) {

Space after if .

 + shift = 0;
 + tmp = flex_bg_size;
 + while ((tmp = 1UL) != 0UL)
 + shift++;

There isn't a log2 function?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] e2fsprogs: New bitmap and inode table allocation for FLEX_BG

2008-01-11 Thread Jose R. Santos
On Fri, 11 Jan 2008 14:01:04 -0700
Andreas Dilger [EMAIL PROTECTED] wrote:

 On Jan 11, 2008  11:28 -0600, Jose R. Santos wrote:
  +blk_t ext2fs_flexbg_offset(ext2_filsys fs, dgrp_t group, int flexbg_size, 
  +  ext2fs_block_bitmap bmap, int offset, int size)

OK.
 
 Could you please add some comments for what this function is trying to do?
 
  +   last_grp = (group + (flexbg_size - (group % flexbg_size) - 1));
 
 Is this the same as:
 
   last_grp = group + (flexbg_size - 1) / flexbg_size * flexbg_size
   
 (i.e. trying to round up to the next even multiple of flexbg_size)?
 
 Didn't we decide to have flexbg_size be a power-of-two value, so we could
 use shift and mask instead of divide and mod?  It's less of an issue because
 group is only a 32-bit value, I guess.

Yes, I fixes this in the kernel code but neglected to fix it on the here.
 
  +   if (ext2fs_get_free_blocks(fs, start_blk, last_blk, 1, bmap, 
  +  first_free))
  +   return first_free;
  +   
  +   if (ext2fs_get_free_blocks(fs, first_free + offset, last_blk, size, 
  +  bmap, first_free))
  +   return first_free;
  +
  +   return first_free;
  +}
  +
   errcode_t ext2fs_allocate_group_table(ext2_filsys fs, dgrp_t group,
ext2fs_block_bitmap bmap)
   {
  errcode_t   retval;
  blk_t   group_blk, start_blk, last_blk, new_blk, blk;
  -   int j;
  +   dgrp_t  last_grp;
  +   int j, rem_grps, flexbg_size = 0;
   
  group_blk = ext2fs_group_first_block(fs, group);
  last_blk = ext2fs_group_last_block(fs, group);
   
  if (!bmap)
  bmap = fs-block_map;
  +
  +   if (EXT2_HAS_INCOMPAT_FEATURE (fs-super,
  +  EXT4_FEATURE_INCOMPAT_FLEX_BG)) {
  +   flexbg_size = 1  fs-super-s_log_groups_per_flex;
  +   rem_grps = flexbg_size - (group % flexbg_size);
 
 Hmm, no point in doing groups % flexbg_size if we have
 s_log_groups_per_flex.  Could do groups  (flexbg_size - 1) instead.
 
  @@ -101,7 +102,11 @@ int ext2fs_super_and_bgd_loc(ext2_filsys fs,
  +   if (flex_bg_size) {
  +   if ((group % flex_bg_size) == 0)
  +   numblocks -= 2 + fs-inode_blocks_per_group;
 
 Ditto.
 
  @@ -1045,6 +1046,20 @@ static void PRS(int argc, char *argv[])
  exit(1);
  }
  break;
  +   case 'G':
  +   flex_bg_size = strtoul(optarg, tmp, 0);
  +   if (*tmp) {
  +   com_err(program_name, 0,
  +   _(Illegal number for Flex_BG size));
  +   exit(1);
  +   }
  +   if (flex_bg_size  2 || 
  +   (flex_bg_size  (flex_bg_size-1)) != 0) {
  +   com_err(program_name, 0,
  +   _(Flex_BG size must be a power of 2));
  +   exit(1);
  +   }
  +   break;
 
 We've been putting new options under -E var=value...  I don't know what
 Ted's thoughs are on using new option letters, though this one might qualify.

I thought this would qualify as a new option letter.  Waiting on input
from Ted.

  @@ -1444,6 +1459,16 @@ static void PRS(int argc, char *argv[])
  }
  }
   
  +   if(flex_bg_size) {
 
 Space after if .

Will fix.
 
  +   shift = 0;
  +   tmp = flex_bg_size;
  +   while ((tmp = 1UL) != 0UL)
  +   shift++;
 
 There isn't a log2 function?

Couldn't find anything in lib or include.
 
 Cheers, Andreas
 --
 Andreas Dilger
 Sr. Staff Engineer, Lustre Group
 Sun Microsystems of Canada, Inc.
 



-JRS
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] New inode allocation for FLEX_BG meta-data groups.

2008-01-11 Thread Andreas Dilger
On Jan 11, 2008  11:28 -0600, Jose R. Santos wrote:
 @@ -127,6 +127,8 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, 
 struct buffer_head *bh,
   mark_bitmap_end(group_blocks, sb-s_blocksize * 8, bh-b_data);
   }
  
 + if (sbi-s_log_groups_per_flex)
 + return free_blocks;
   return free_blocks - sbi-s_itb_per_group - 2;

To be honest, I think this is a wart in ext4_init_block_bitmap() that
it returns the number of free blocks in the group.  That value should
really be set at mke2fs or e2fsck time, and if the last group is marked
BLOCK_UNINIT it gets the free blocks count wrong because it always starts
with EXT4_BLOCKS_PER_GROUP().

The above patch may also be incorrect since there may be inode tables or
bitmaps in the above group even in the case of FLEX_BG filesystems.

 +#define free_block_ratio 10
 +
 +static int find_group_flex(struct super_block *sb, struct inode *parent, 
 ext4_group_t *best_group)
 +{
 + n_fbg_groups = (sbi-s_groups_count + flex_size - 1) / flex_size;

Can be a shift?

I would suggest doing some kind of testing to see how well this allocation
policy is working.  We don't want to force all allocations contiguously at
the start of the filesystem, or we end up with FAT...

 +static int ext4_fill_flex_info(struct super_block *sb)
 +{
 + sbi-s_log_groups_per_flex = sbi-s_es-s_log_groups_per_flex;

Hmm, I guess no le*_to_cpu() because this is 8 bits?

 +
 + flex_group_count = (sbi-s_groups_count + groups_per_flex - 1) /
 + groups_per_flex;
 + sbi-s_flex_groups = kmalloc(flex_group_count *
 +  sizeof(struct flex_groups), GFP_KERNEL);
 + if (sbi-s_flex_groups == NULL) {
 + printk(KERN_ERR EXT4-fs: not enough memory\n);

This should report not enough memory for N flex groups or something.

 @@ -2105,6 +2154,13 @@ static int ext4_fill_super (struct super_block *sb,
 + if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG))
 + if (!ext4_fill_flex_info(sb)) {
 + printk(KERN_ERR
 +EXT4-fs: unable to initialize flex_bg meta 
 info!\n);
 + goto failed_mount2;

Should this be considered a fatal error, or could sbi-s_log_groups_per_flex
just be set to 0 and the filesystem be used as-is (maybe with sub-optimal
allocations or something)?  Otherwise this renders the filesystem unusable.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] New inode allocation for FLEX_BG meta-data groups.

2008-01-11 Thread Jose R. Santos
On Fri, 11 Jan 2008 14:46:58 -0700
Andreas Dilger [EMAIL PROTECTED] wrote:

 On Jan 11, 2008  11:28 -0600, Jose R. Santos wrote:
  @@ -127,6 +127,8 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, 
  struct buffer_head *bh,
  mark_bitmap_end(group_blocks, sb-s_blocksize * 8, bh-b_data);
  }
   
  +   if (sbi-s_log_groups_per_flex)
  +   return free_blocks;
  return free_blocks - sbi-s_itb_per_group - 2;
 
 To be honest, I think this is a wart in ext4_init_block_bitmap() that
 it returns the number of free blocks in the group.  That value should
 really be set at mke2fs or e2fsck time, and if the last group is marked
 BLOCK_UNINIT it gets the free blocks count wrong because it always starts
 with EXT4_BLOCKS_PER_GROUP().
 
 The above patch may also be incorrect since there may be inode tables or
 bitmaps in the above group even in the case of FLEX_BG filesystems.
 
  +#define free_block_ratio 10
  +
  +static int find_group_flex(struct super_block *sb, struct inode *parent, 
  ext4_group_t *best_group)
  +{
  +   n_fbg_groups = (sbi-s_groups_count + flex_size - 1) / flex_size;
 
 Can be a shift?

You're right.  This should be:

n_fbg_groups = (sbi-s_groups_count + flex_size - 1)  
sbi-s_log_groups_per_flex;

 I would suggest doing some kind of testing to see how well this allocation
 policy is working.  We don't want to force all allocations contiguously at
 the start of the filesystem, or we end up with FAT...

I've done several IO patterns with multiple threads and all of my test
are either same or faster performance than with the regular allocator.
Im hopping that moving this to the patch queue will expose the
allocator to more tests.
 
  +static int ext4_fill_flex_info(struct super_block *sb)
  +{
  +   sbi-s_log_groups_per_flex = sbi-s_es-s_log_groups_per_flex;
 
 Hmm, I guess no le*_to_cpu() because this is 8 bits?

Correct.

  +
  +   flex_group_count = (sbi-s_groups_count + groups_per_flex - 1) /
  +   groups_per_flex;
  +   sbi-s_flex_groups = kmalloc(flex_group_count *
  +sizeof(struct flex_groups), GFP_KERNEL);
  +   if (sbi-s_flex_groups == NULL) {
  +   printk(KERN_ERR EXT4-fs: not enough memory\n);
 
 This should report not enough memory for N flex groups or something.

OK.
 
  @@ -2105,6 +2154,13 @@ static int ext4_fill_super (struct super_block *sb,
  +   if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG))
  +   if (!ext4_fill_flex_info(sb)) {
  +   printk(KERN_ERR
  +  EXT4-fs: unable to initialize flex_bg meta 
  info!\n);
  +   goto failed_mount2;
 
 Should this be considered a fatal error, or could sbi-s_log_groups_per_flex
 just be set to 0 and the filesystem be used as-is (maybe with sub-optimal
 allocations or something)?  Otherwise this renders the filesystem unusable.

I thought about doing that but using a sub-optimal allocator would
permanently screw up the ondisk data locality.  Maybe mounting the
filesystem read-only would be more appropriate.

 Cheers, Andreas
 --
 Andreas Dilger
 Sr. Staff Engineer, Lustre Group
 Sun Microsystems of Canada, Inc.
 



-JRS
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html