Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-28 Thread Theodore Ts'o
On Sun, Oct 28, 2012 at 09:35:58PM -0500, Eric Sandeen wrote:
> Yeah, seems that way.
> 
> Then your simpler version is probably the way to go.

If you have a chance, could you do me a favor and test my -v3 version
of the patch?  It should work just as well as yours, but I'm getting
paranoid in my old age, and you seem to have a reliable way of testing
for this failure.  I still need to figure out why my kvm based
approach isn't showing the problem

Thanks,

-- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-28 Thread Eric Sandeen
On 10/28/12 9:34 PM, Theodore Ts'o wrote:
> On Sun, Oct 28, 2012 at 09:24:19PM -0500, Eric Sandeen wrote:
>> Yeah, I knew it wasn't ;)  I did resend 
>> [PATCH] ext4: fix unjournaled inode bitmap modification
>> which is a bit more involved.
> 
> Yeah, sorry, I didn't see your updated patch at first, since this mail
> thread is one complicated tangle.  :-(
> 
>> That'll get_write_access on the same buffer over and over, I suppose
>> it's ok, but the patch I sent tries to minimize that, and call
>> ext4_handle_release_buffer if we're not going to use it (which is
>> a no-op today anyway and not normally used I guess...)
> 
> Well, it's really rare that we will go through that loop more than
> once; it only happens if we have multiple processes race against each
> other trying to grab the same inode.
> 
>> If ext4_handle_release_buffer() is dead code now, and repeated calls
>> via repeat_in_this_group: are no big deal, then your version looks fine.
> 
> Yeah, I think it's pretty much dead code.  At least, I can't think of
> a good reason why we would want to actually try to handle
> ext4_handle_release_buffer() to claw back the transaciton credit.  And
> if we do, we'll have to do a sweep through the entire ext4 codebase
> anyway.

Yeah, seems that way.

Then your simpler version is probably the way to go.

Thanks,
-Eric

>   - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-28 Thread Theodore Ts'o
On Sun, Oct 28, 2012 at 09:24:19PM -0500, Eric Sandeen wrote:
> Yeah, I knew it wasn't ;)  I did resend 
> [PATCH] ext4: fix unjournaled inode bitmap modification
> which is a bit more involved.

Yeah, sorry, I didn't see your updated patch at first, since this mail
thread is one complicated tangle.  :-(

> That'll get_write_access on the same buffer over and over, I suppose
> it's ok, but the patch I sent tries to minimize that, and call
> ext4_handle_release_buffer if we're not going to use it (which is
> a no-op today anyway and not normally used I guess...)

Well, it's really rare that we will go through that loop more than
once; it only happens if we have multiple processes race against each
other trying to grab the same inode.

> If ext4_handle_release_buffer() is dead code now, and repeated calls
> via repeat_in_this_group: are no big deal, then your version looks fine.

Yeah, I think it's pretty much dead code.  At least, I can't think of
a good reason why we would want to actually try to handle
ext4_handle_release_buffer() to claw back the transaciton credit.  And
if we do, we'll have to do a sweep through the entire ext4 codebase
anyway.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-28 Thread Eric Sandeen
On 10/28/12 8:00 PM, Theodore Ts'o wrote:
> On Sat, Oct 27, 2012 at 05:42:07PM -0500, Eric Sandeen wrote:
>>
>> It looks like the inode_bitmap_bh is being modified outside a transaction:
>>
>> ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
>>
>> It needs a get_write_access / handle_dirty_metadata pair around it.
> 
> Oops.   Nice catch!!
> 
> The patch isn't quite right, though.  

Yeah, I knew it wasn't ;)  I did resend 
[PATCH] ext4: fix unjournaled inode bitmap modification
which is a bit more involved.

> We only want to call
> ext4_journal_get_write_access() when we know that there is an available
> bit in the bitmap.  (We could still lose the race, but in that case
> the winner of the race probably grabbed the bitmap block first.)
> 
> Also, we only need to call ext4_handle_dirty_metadata() if we
> successfully grab the bit in the bitmap.
> 
> So I suggest this patch instead:

That'll get_write_access on the same buffer over and over, I suppose
it's ok, but the patch I sent tries to minimize that, and call
ext4_handle_release_buffer if we're not going to use it (which is
a no-op today anyway and not normally used I guess...)

If ext4_handle_release_buffer() is dead code now, and repeated calls
via repeat_in_this_group: are no big deal, then your version looks fine.

-Eric

> commit 087eda81f1ac6a6a0394f781b44f1d555d8f64c6
> Author: Eric Sandeen 
> Date:   Sun Oct 28 20:59:57 2012 -0400
> 
> ext4: fix unjournaled inode bitmap modification
> 
> commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 modified this function
> such that the inode bitmap was being modified outside a transaction,
> which could lead to corruption, and was discovered when journal_checksum
> found a bad checksum in the journal.
> 
> Reported-by: Nix 
> Signed-off-by: Eric Sandeen 
> Signed-off-by: "Theodore Ts'o" 
> Cc: sta...@vger.kernel.org
> 
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 4facdd2..575afac 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -725,6 +725,10 @@ repeat_in_this_group:
>  "inode=%lu", ino + 1);
>   continue;
>   }
> + BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
> + err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
> + if (err)
> + goto fail;
>   ext4_lock_group(sb, group);
>   ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
>   ext4_unlock_group(sb, group);
> @@ -738,6 +742,11 @@ repeat_in_this_group:
>   goto out;
>  
>  got:
> + BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
> + err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
> + if (err)
> + goto fail;
> +
>   /* We may have to initialize the block bitmap if it isn't already */
>   if (ext4_has_group_desc_csum(sb) &&
>   gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-28 Thread Nix
On 29 Oct 2012, Theodore Ts'o spake thusly:

> commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 modified this function
> such that the inode bitmap was being modified outside a transaction,
> which could lead to corruption, and was discovered when journal_checksum
> found a bad checksum in the journal.

Hm. If this could have caused corruption for non-checksum users, it must
be a pretty rare case if nobody's hit it in six months -- or maybe, I
suppose, they hit it and never noticed. (But, hey, this makes me happier
to have reported this despite all the flap, if it's found a genuine bug
that could have hit people not using wierdo mount options.)

Thanks for spending so much time on this fix. Much appreciated.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-28 Thread Theodore Ts'o
On Sat, Oct 27, 2012 at 05:42:07PM -0500, Eric Sandeen wrote:
> 
> It looks like the inode_bitmap_bh is being modified outside a transaction:
> 
> ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
> 
> It needs a get_write_access / handle_dirty_metadata pair around it.

Oops.   Nice catch!!

The patch isn't quite right, though.  We only want to call
ext4_journal_get_write_access() when we know that there is an available
bit in the bitmap.  (We could still lose the race, but in that case
the winner of the race probably grabbed the bitmap block first.)

Also, we only need to call ext4_handle_dirty_metadata() if we
successfully grab the bit in the bitmap.

So I suggest this patch instead:

commit 087eda81f1ac6a6a0394f781b44f1d555d8f64c6
Author: Eric Sandeen 
Date:   Sun Oct 28 20:59:57 2012 -0400

ext4: fix unjournaled inode bitmap modification

commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 modified this function
such that the inode bitmap was being modified outside a transaction,
which could lead to corruption, and was discovered when journal_checksum
found a bad checksum in the journal.

Reported-by: Nix 
Signed-off-by: Eric Sandeen 
Signed-off-by: "Theodore Ts'o" 
Cc: sta...@vger.kernel.org

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 4facdd2..575afac 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -725,6 +725,10 @@ repeat_in_this_group:
   "inode=%lu", ino + 1);
continue;
}
+   BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
+   err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
+   if (err)
+   goto fail;
ext4_lock_group(sb, group);
ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
ext4_unlock_group(sb, group);
@@ -738,6 +742,11 @@ repeat_in_this_group:
goto out;
 
 got:
+   BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
+   err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
+   if (err)
+   goto fail;
+
/* We may have to initialize the block bitmap if it isn't already */
if (ext4_has_group_desc_csum(sb) &&
gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-28 Thread Theodore Ts'o
On Sat, Oct 27, 2012 at 05:42:07PM -0500, Eric Sandeen wrote:
 
 It looks like the inode_bitmap_bh is being modified outside a transaction:
 
 ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh-b_data);
 
 It needs a get_write_access / handle_dirty_metadata pair around it.

Oops.   Nice catch!!

The patch isn't quite right, though.  We only want to call
ext4_journal_get_write_access() when we know that there is an available
bit in the bitmap.  (We could still lose the race, but in that case
the winner of the race probably grabbed the bitmap block first.)

Also, we only need to call ext4_handle_dirty_metadata() if we
successfully grab the bit in the bitmap.

So I suggest this patch instead:

commit 087eda81f1ac6a6a0394f781b44f1d555d8f64c6
Author: Eric Sandeen sand...@redhat.com
Date:   Sun Oct 28 20:59:57 2012 -0400

ext4: fix unjournaled inode bitmap modification

commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 modified this function
such that the inode bitmap was being modified outside a transaction,
which could lead to corruption, and was discovered when journal_checksum
found a bad checksum in the journal.

Reported-by: Nix n...@esperi.org.uk
Signed-off-by: Eric Sandeen sand...@redhat.com
Signed-off-by: Theodore Ts'o ty...@mit.edu
Cc: sta...@vger.kernel.org

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 4facdd2..575afac 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -725,6 +725,10 @@ repeat_in_this_group:
   inode=%lu, ino + 1);
continue;
}
+   BUFFER_TRACE(inode_bitmap_bh, get_write_access);
+   err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
+   if (err)
+   goto fail;
ext4_lock_group(sb, group);
ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh-b_data);
ext4_unlock_group(sb, group);
@@ -738,6 +742,11 @@ repeat_in_this_group:
goto out;
 
 got:
+   BUFFER_TRACE(inode_bitmap_bh, call ext4_handle_dirty_metadata);
+   err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
+   if (err)
+   goto fail;
+
/* We may have to initialize the block bitmap if it isn't already */
if (ext4_has_group_desc_csum(sb) 
gdp-bg_flags  cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-28 Thread Nix
On 29 Oct 2012, Theodore Ts'o spake thusly:

 commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 modified this function
 such that the inode bitmap was being modified outside a transaction,
 which could lead to corruption, and was discovered when journal_checksum
 found a bad checksum in the journal.

Hm. If this could have caused corruption for non-checksum users, it must
be a pretty rare case if nobody's hit it in six months -- or maybe, I
suppose, they hit it and never noticed. (But, hey, this makes me happier
to have reported this despite all the flap, if it's found a genuine bug
that could have hit people not using wierdo mount options.)

Thanks for spending so much time on this fix. Much appreciated.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-28 Thread Eric Sandeen
On 10/28/12 8:00 PM, Theodore Ts'o wrote:
 On Sat, Oct 27, 2012 at 05:42:07PM -0500, Eric Sandeen wrote:

 It looks like the inode_bitmap_bh is being modified outside a transaction:

 ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh-b_data);

 It needs a get_write_access / handle_dirty_metadata pair around it.
 
 Oops.   Nice catch!!
 
 The patch isn't quite right, though.  

Yeah, I knew it wasn't ;)  I did resend 
[PATCH] ext4: fix unjournaled inode bitmap modification
which is a bit more involved.

 We only want to call
 ext4_journal_get_write_access() when we know that there is an available
 bit in the bitmap.  (We could still lose the race, but in that case
 the winner of the race probably grabbed the bitmap block first.)
 
 Also, we only need to call ext4_handle_dirty_metadata() if we
 successfully grab the bit in the bitmap.
 
 So I suggest this patch instead:

That'll get_write_access on the same buffer over and over, I suppose
it's ok, but the patch I sent tries to minimize that, and call
ext4_handle_release_buffer if we're not going to use it (which is
a no-op today anyway and not normally used I guess...)

If ext4_handle_release_buffer() is dead code now, and repeated calls
via repeat_in_this_group: are no big deal, then your version looks fine.

-Eric

 commit 087eda81f1ac6a6a0394f781b44f1d555d8f64c6
 Author: Eric Sandeen sand...@redhat.com
 Date:   Sun Oct 28 20:59:57 2012 -0400
 
 ext4: fix unjournaled inode bitmap modification
 
 commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 modified this function
 such that the inode bitmap was being modified outside a transaction,
 which could lead to corruption, and was discovered when journal_checksum
 found a bad checksum in the journal.
 
 Reported-by: Nix n...@esperi.org.uk
 Signed-off-by: Eric Sandeen sand...@redhat.com
 Signed-off-by: Theodore Ts'o ty...@mit.edu
 Cc: sta...@vger.kernel.org
 
 diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
 index 4facdd2..575afac 100644
 --- a/fs/ext4/ialloc.c
 +++ b/fs/ext4/ialloc.c
 @@ -725,6 +725,10 @@ repeat_in_this_group:
  inode=%lu, ino + 1);
   continue;
   }
 + BUFFER_TRACE(inode_bitmap_bh, get_write_access);
 + err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
 + if (err)
 + goto fail;
   ext4_lock_group(sb, group);
   ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh-b_data);
   ext4_unlock_group(sb, group);
 @@ -738,6 +742,11 @@ repeat_in_this_group:
   goto out;
  
  got:
 + BUFFER_TRACE(inode_bitmap_bh, call ext4_handle_dirty_metadata);
 + err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
 + if (err)
 + goto fail;
 +
   /* We may have to initialize the block bitmap if it isn't already */
   if (ext4_has_group_desc_csum(sb) 
   gdp-bg_flags  cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
 --
 To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-28 Thread Theodore Ts'o
On Sun, Oct 28, 2012 at 09:24:19PM -0500, Eric Sandeen wrote:
 Yeah, I knew it wasn't ;)  I did resend 
 [PATCH] ext4: fix unjournaled inode bitmap modification
 which is a bit more involved.

Yeah, sorry, I didn't see your updated patch at first, since this mail
thread is one complicated tangle.  :-(

 That'll get_write_access on the same buffer over and over, I suppose
 it's ok, but the patch I sent tries to minimize that, and call
 ext4_handle_release_buffer if we're not going to use it (which is
 a no-op today anyway and not normally used I guess...)

Well, it's really rare that we will go through that loop more than
once; it only happens if we have multiple processes race against each
other trying to grab the same inode.

 If ext4_handle_release_buffer() is dead code now, and repeated calls
 via repeat_in_this_group: are no big deal, then your version looks fine.

Yeah, I think it's pretty much dead code.  At least, I can't think of
a good reason why we would want to actually try to handle
ext4_handle_release_buffer() to claw back the transaciton credit.  And
if we do, we'll have to do a sweep through the entire ext4 codebase
anyway.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-28 Thread Eric Sandeen
On 10/28/12 9:34 PM, Theodore Ts'o wrote:
 On Sun, Oct 28, 2012 at 09:24:19PM -0500, Eric Sandeen wrote:
 Yeah, I knew it wasn't ;)  I did resend 
 [PATCH] ext4: fix unjournaled inode bitmap modification
 which is a bit more involved.
 
 Yeah, sorry, I didn't see your updated patch at first, since this mail
 thread is one complicated tangle.  :-(
 
 That'll get_write_access on the same buffer over and over, I suppose
 it's ok, but the patch I sent tries to minimize that, and call
 ext4_handle_release_buffer if we're not going to use it (which is
 a no-op today anyway and not normally used I guess...)
 
 Well, it's really rare that we will go through that loop more than
 once; it only happens if we have multiple processes race against each
 other trying to grab the same inode.
 
 If ext4_handle_release_buffer() is dead code now, and repeated calls
 via repeat_in_this_group: are no big deal, then your version looks fine.
 
 Yeah, I think it's pretty much dead code.  At least, I can't think of
 a good reason why we would want to actually try to handle
 ext4_handle_release_buffer() to claw back the transaciton credit.  And
 if we do, we'll have to do a sweep through the entire ext4 codebase
 anyway.

Yeah, seems that way.

Then your simpler version is probably the way to go.

Thanks,
-Eric

   - Ted
 --
 To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-28 Thread Theodore Ts'o
On Sun, Oct 28, 2012 at 09:35:58PM -0500, Eric Sandeen wrote:
 Yeah, seems that way.
 
 Then your simpler version is probably the way to go.

If you have a chance, could you do me a favor and test my -v3 version
of the patch?  It should work just as well as yours, but I'm getting
paranoid in my old age, and you seem to have a reliable way of testing
for this failure.  I still need to figure out why my kvm based
approach isn't showing the problem

Thanks,

-- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-27 Thread Eric Sandeen
On 10/27/12 4:19 PM, Eric Sandeen wrote:
> On 10/27/12 1:47 PM, Nix wrote:
>> On 27 Oct 2012, Theodore Ts'o said:
>>
>>> On Sat, Oct 27, 2012 at 01:45:25PM +0100, Nix wrote:
 Ah! it's turned on by journal_async_commit. OK, that alone argues
 against use of journal_async_commit, tested or not, and I'd not have
 turned it on if I'd noticed that.

 (So, the combinations I'll be trying for effect on this bug are:

  journal_async_commit (as now)
  journal_checksum
  none
>>>
>>> Can you also check and see whether the presence or absence of
>>> "nobarrier" makes a difference?
>>
>> Done. (Also checked the effect of your patches posted earlier this week:
>> no effect, I'm afraid, certainly not under the fails-even-on-3.6.1 test
>> I was carrying out, umount -l'ing /var as the very last thing I did
>> before /sbin/reboot -f.)
>>
>> nobarrier makes a difference that I, at least, did not expect:
>>
>> [no options]No corruption
>>
>> nobarrier   No corruption
>>
>>   journal_checksum  Corruption
>> Corrupted transaction, journal aborted
>> 
>> nobarrier,journal_checksum  Corruption
>> Corrupted transaction, journal aborted
>>
>>   journal_async_commit  Corruption
>> Corrupted transaction, journal aborted
>>
>> nobarrier,journal_async_commit  Corruption
>> No corrupted transaction or aborted journal
> 
> That's what we needed.  Woulda been great a few days ago ;)
> 
> In my testing journal_checksum is broken, and my bisection seems to
> implicate
> 
> commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31
> Author: Theodore Ts'o 
> Date:   Mon Feb 6 20:12:03 2012 -0500
> 
> ext4: fold ext4_claim_inode into ext4_new_inode
> 
> as the culprit.  I haven't had time to look into why, yet.

It looks like the inode_bitmap_bh is being modified outside a transaction:

ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);

It needs a get_write_access / handle_dirty_metadata pair around it.

A hacky patch like this seems to work but it was done 5mins before I have
to be out the door to dinner so it probably needs cleanup or at least
scrutiny.

[PATCH] ext4: fix unjournaled inode bitmap modification

commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 modified this function
such that the inode bitmap was being modified outside a transaction,
which could lead to corruption, and was discovered when journal_checksum
found a bad checksum in the journal.

Signed-off-by: Eric Sandeen 
---

--- ialloc.c.reverted2  2012-10-27 17:31:20.351537073 -0500
+++ ialloc.c2012-10-27 17:40:18.643553576 -0500
@@ -669,6 +669,10 @@
inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
if (!inode_bitmap_bh)
goto fail;
+   BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
+   err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
+   if (err)
+   goto fail;
 
 repeat_in_this_group:
ino = ext4_find_next_zero_bit((unsigned long *)
@@ -690,6 +694,10 @@
ino++;  /* the inode bitmap is zero-based */
if (!ret2)
goto got; /* we grabbed the inode! */
+   BUFFER_TRACE(inode_bitmap_bh, "call 
ext4_handle_dirty_metadata");
+   err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
+   if (err)
+   goto fail;
if (ino < EXT4_INODES_PER_GROUP(sb))
goto repeat_in_this_group;
}




> -Eric
> 
>> I didn't expect the last case at all, and it adequately explains why you
>> are mostly seeing corrupted journal messages in your tests but I was
>> not. It also explains why when I saw this for the first time I was able
>> to mount the resulting corrupted filesystem read-write and corrupt it
>> further before I noticed that anything was wrong.
>>
>> It is also clear that journal_checksum and all that relies on it is
>> worse than useless right now, as Eric reported while I was testing this.
>> It should probably be marked CONFIG_BROKEN in future 3.[346].* stable
>> kernels, if CONFIG_BROKEN existed anymore, which it doesn't.
>>
>> It's a shame journal_async_commit depends on a broken feature: it might
>> be notionally unsafe but on some of my systems (without nobarrier or
>> flashy caching controllers) it was associated with a noticeable speedup
>> of metadata-heavy workloads -- though that was way back in 2009...
>> however, "safety first" definitely applies in this case.
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ 

Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-27 Thread Eric Sandeen
On 10/27/12 1:47 PM, Nix wrote:
> On 27 Oct 2012, Theodore Ts'o said:
> 
>> On Sat, Oct 27, 2012 at 01:45:25PM +0100, Nix wrote:
>>> Ah! it's turned on by journal_async_commit. OK, that alone argues
>>> against use of journal_async_commit, tested or not, and I'd not have
>>> turned it on if I'd noticed that.
>>>
>>> (So, the combinations I'll be trying for effect on this bug are:
>>>
>>>  journal_async_commit (as now)
>>>  journal_checksum
>>>  none
>>
>> Can you also check and see whether the presence or absence of
>> "nobarrier" makes a difference?
> 
> Done. (Also checked the effect of your patches posted earlier this week:
> no effect, I'm afraid, certainly not under the fails-even-on-3.6.1 test
> I was carrying out, umount -l'ing /var as the very last thing I did
> before /sbin/reboot -f.)
> 
> nobarrier makes a difference that I, at least, did not expect:
> 
> [no options]No corruption
> 
> nobarrier   No corruption
> 
>   journal_checksum  Corruption
> Corrupted transaction, journal aborted
> 
> nobarrier,journal_checksum  Corruption
> Corrupted transaction, journal aborted
> 
>   journal_async_commit  Corruption
> Corrupted transaction, journal aborted
> 
> nobarrier,journal_async_commit  Corruption
> No corrupted transaction or aborted journal

That's what we needed.  Woulda been great a few days ago ;)

In my testing journal_checksum is broken, and my bisection seems to
implicate

commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31
Author: Theodore Ts'o 
Date:   Mon Feb 6 20:12:03 2012 -0500

ext4: fold ext4_claim_inode into ext4_new_inode

as the culprit.  I haven't had time to look into why, yet.

-Eric

> I didn't expect the last case at all, and it adequately explains why you
> are mostly seeing corrupted journal messages in your tests but I was
> not. It also explains why when I saw this for the first time I was able
> to mount the resulting corrupted filesystem read-write and corrupt it
> further before I noticed that anything was wrong.
> 
> It is also clear that journal_checksum and all that relies on it is
> worse than useless right now, as Eric reported while I was testing this.
> It should probably be marked CONFIG_BROKEN in future 3.[346].* stable
> kernels, if CONFIG_BROKEN existed anymore, which it doesn't.
> 
> It's a shame journal_async_commit depends on a broken feature: it might
> be notionally unsafe but on some of my systems (without nobarrier or
> flashy caching controllers) it was associated with a noticeable speedup
> of metadata-heavy workloads -- though that was way back in 2009...
> however, "safety first" definitely applies in this case.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-27 Thread Nix
On 27 Oct 2012, Theodore Ts'o said:

> On Sat, Oct 27, 2012 at 01:45:25PM +0100, Nix wrote:
>> Ah! it's turned on by journal_async_commit. OK, that alone argues
>> against use of journal_async_commit, tested or not, and I'd not have
>> turned it on if I'd noticed that.
>> 
>> (So, the combinations I'll be trying for effect on this bug are:
>> 
>>  journal_async_commit (as now)
>>  journal_checksum
>>  none
>
> Can you also check and see whether the presence or absence of
> "nobarrier" makes a difference?

Done. (Also checked the effect of your patches posted earlier this week:
no effect, I'm afraid, certainly not under the fails-even-on-3.6.1 test
I was carrying out, umount -l'ing /var as the very last thing I did
before /sbin/reboot -f.)

nobarrier makes a difference that I, at least, did not expect:

[no options]No corruption

nobarrier   No corruption

  journal_checksum  Corruption
Corrupted transaction, journal aborted

nobarrier,journal_checksum  Corruption
Corrupted transaction, journal aborted

  journal_async_commit  Corruption
Corrupted transaction, journal aborted

nobarrier,journal_async_commit  Corruption
No corrupted transaction or aborted journal

I didn't expect the last case at all, and it adequately explains why you
are mostly seeing corrupted journal messages in your tests but I was
not. It also explains why when I saw this for the first time I was able
to mount the resulting corrupted filesystem read-write and corrupt it
further before I noticed that anything was wrong.

It is also clear that journal_checksum and all that relies on it is
worse than useless right now, as Eric reported while I was testing this.
It should probably be marked CONFIG_BROKEN in future 3.[346].* stable
kernels, if CONFIG_BROKEN existed anymore, which it doesn't.

It's a shame journal_async_commit depends on a broken feature: it might
be notionally unsafe but on some of my systems (without nobarrier or
flashy caching controllers) it was associated with a noticeable speedup
of metadata-heavy workloads -- though that was way back in 2009...
however, "safety first" definitely applies in this case.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-27 Thread Eric Sandeen
On 10/27/12 7:45 AM, Nix wrote:
> [nfs people purged from Cc]
> 
> On 27 Oct 2012, Theodore Ts'o verbalised:
> 
>> Huh?  It's not turned on by default.  If you mount with no mount
>> options, journal checksums are *not* turned on.
> 
> ?! it's turned on for me, and though I use weird mount options I don't
> use that one:

journal_async_commit implies journal_checksum:

{Opt_journal_async_commit, (EXT4_MOUNT_JOURNAL_ASYNC_COMMIT |
EXT4_MOUNT_JOURNAL_CHECKSUM), MOPT_SET},

journal_checksum seems to have broken, at least for me, between 3.3 & 3.4, I 
think I've narrowed down the commit but not sure yet what the flaw is, will 
investigate & report back later.  Busy Saturday.

So anyway, turning on journal_async_commit (notionally unsafe) enables 
journal_checksum (apparently broken).

-Eric

> /dev/main/var   /varext4
> defaults,nobarrier,usrquota,grpquota,nosuid,nodev,relatime,journal_async_commit,commit=30,user_xattr,acl
>  1  2
> Default mount options:(none)
> /dev/mapper/main-var /var ext4 
> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota
>  0 0
> 
> ...
> 
> Ah! it's turned on by journal_async_commit. OK, that alone argues
> against use of journal_async_commit, tested or not, and I'd not have
> turned it on if I'd noticed that.
> 
> (So, the combinations I'll be trying for effect on this bug are:
> 
>  journal_async_commit (as now)
>  journal_checksum
>  none
> 
> Technically to investigate all possibilities we should try
> journal_async_commit,no_journal_checksum, but this seems so unlikely to
> have ever been tested by anyone that it's not worth looking into...)
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-27 Thread Theodore Ts'o
On Sat, Oct 27, 2012 at 01:45:25PM +0100, Nix wrote:
> Ah! it's turned on by journal_async_commit. OK, that alone argues
> against use of journal_async_commit, tested or not, and I'd not have
> turned it on if I'd noticed that.
> 
> (So, the combinations I'll be trying for effect on this bug are:
> 
>  journal_async_commit (as now)
>  journal_checksum
>  none

Can you also check and see whether the presence or absence of
"nobarrier" makes a difference?

Thanks,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-27 Thread Nix
[nfs people purged from Cc]

On 27 Oct 2012, Theodore Ts'o verbalised:

> Huh?  It's not turned on by default.  If you mount with no mount
> options, journal checksums are *not* turned on.

?! it's turned on for me, and though I use weird mount options I don't
use that one:

/dev/main/var   /varext4
defaults,nobarrier,usrquota,grpquota,nosuid,nodev,relatime,journal_async_commit,commit=30,user_xattr,acl
 1  2
Default mount options:(none)
/dev/mapper/main-var /var ext4 
rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota
 0 0

...

Ah! it's turned on by journal_async_commit. OK, that alone argues
against use of journal_async_commit, tested or not, and I'd not have
turned it on if I'd noticed that.

(So, the combinations I'll be trying for effect on this bug are:

 journal_async_commit (as now)
 journal_checksum
 none

Technically to investigate all possibilities we should try
journal_async_commit,no_journal_checksum, but this seems so unlikely to
have ever been tested by anyone that it's not worth looking into...)

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-27 Thread Nix
[nfs people purged from Cc]

On 27 Oct 2012, Theodore Ts'o verbalised:

 Huh?  It's not turned on by default.  If you mount with no mount
 options, journal checksums are *not* turned on.

?! it's turned on for me, and though I use weird mount options I don't
use that one:

/dev/main/var   /varext4
defaults,nobarrier,usrquota,grpquota,nosuid,nodev,relatime,journal_async_commit,commit=30,user_xattr,acl
 1  2
Default mount options:(none)
/dev/mapper/main-var /var ext4 
rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota
 0 0

...

Ah! it's turned on by journal_async_commit. OK, that alone argues
against use of journal_async_commit, tested or not, and I'd not have
turned it on if I'd noticed that.

(So, the combinations I'll be trying for effect on this bug are:

 journal_async_commit (as now)
 journal_checksum
 none

Technically to investigate all possibilities we should try
journal_async_commit,no_journal_checksum, but this seems so unlikely to
have ever been tested by anyone that it's not worth looking into...)

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-27 Thread Theodore Ts'o
On Sat, Oct 27, 2012 at 01:45:25PM +0100, Nix wrote:
 Ah! it's turned on by journal_async_commit. OK, that alone argues
 against use of journal_async_commit, tested or not, and I'd not have
 turned it on if I'd noticed that.
 
 (So, the combinations I'll be trying for effect on this bug are:
 
  journal_async_commit (as now)
  journal_checksum
  none

Can you also check and see whether the presence or absence of
nobarrier makes a difference?

Thanks,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-27 Thread Eric Sandeen
On 10/27/12 7:45 AM, Nix wrote:
 [nfs people purged from Cc]
 
 On 27 Oct 2012, Theodore Ts'o verbalised:
 
 Huh?  It's not turned on by default.  If you mount with no mount
 options, journal checksums are *not* turned on.
 
 ?! it's turned on for me, and though I use weird mount options I don't
 use that one:

journal_async_commit implies journal_checksum:

{Opt_journal_async_commit, (EXT4_MOUNT_JOURNAL_ASYNC_COMMIT |
EXT4_MOUNT_JOURNAL_CHECKSUM), MOPT_SET},

journal_checksum seems to have broken, at least for me, between 3.3  3.4, I 
think I've narrowed down the commit but not sure yet what the flaw is, will 
investigate  report back later.  Busy Saturday.

So anyway, turning on journal_async_commit (notionally unsafe) enables 
journal_checksum (apparently broken).

-Eric

 /dev/main/var   /varext4
 defaults,nobarrier,usrquota,grpquota,nosuid,nodev,relatime,journal_async_commit,commit=30,user_xattr,acl
  1  2
 Default mount options:(none)
 /dev/mapper/main-var /var ext4 
 rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota
  0 0
 
 ...
 
 Ah! it's turned on by journal_async_commit. OK, that alone argues
 against use of journal_async_commit, tested or not, and I'd not have
 turned it on if I'd noticed that.
 
 (So, the combinations I'll be trying for effect on this bug are:
 
  journal_async_commit (as now)
  journal_checksum
  none
 
 Technically to investigate all possibilities we should try
 journal_async_commit,no_journal_checksum, but this seems so unlikely to
 have ever been tested by anyone that it's not worth looking into...)
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-27 Thread Nix
On 27 Oct 2012, Theodore Ts'o said:

 On Sat, Oct 27, 2012 at 01:45:25PM +0100, Nix wrote:
 Ah! it's turned on by journal_async_commit. OK, that alone argues
 against use of journal_async_commit, tested or not, and I'd not have
 turned it on if I'd noticed that.
 
 (So, the combinations I'll be trying for effect on this bug are:
 
  journal_async_commit (as now)
  journal_checksum
  none

 Can you also check and see whether the presence or absence of
 nobarrier makes a difference?

Done. (Also checked the effect of your patches posted earlier this week:
no effect, I'm afraid, certainly not under the fails-even-on-3.6.1 test
I was carrying out, umount -l'ing /var as the very last thing I did
before /sbin/reboot -f.)

nobarrier makes a difference that I, at least, did not expect:

[no options]No corruption

nobarrier   No corruption

  journal_checksum  Corruption
Corrupted transaction, journal aborted

nobarrier,journal_checksum  Corruption
Corrupted transaction, journal aborted

  journal_async_commit  Corruption
Corrupted transaction, journal aborted

nobarrier,journal_async_commit  Corruption
No corrupted transaction or aborted journal

I didn't expect the last case at all, and it adequately explains why you
are mostly seeing corrupted journal messages in your tests but I was
not. It also explains why when I saw this for the first time I was able
to mount the resulting corrupted filesystem read-write and corrupt it
further before I noticed that anything was wrong.

It is also clear that journal_checksum and all that relies on it is
worse than useless right now, as Eric reported while I was testing this.
It should probably be marked CONFIG_BROKEN in future 3.[346].* stable
kernels, if CONFIG_BROKEN existed anymore, which it doesn't.

It's a shame journal_async_commit depends on a broken feature: it might
be notionally unsafe but on some of my systems (without nobarrier or
flashy caching controllers) it was associated with a noticeable speedup
of metadata-heavy workloads -- though that was way back in 2009...
however, safety first definitely applies in this case.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-27 Thread Eric Sandeen
On 10/27/12 1:47 PM, Nix wrote:
 On 27 Oct 2012, Theodore Ts'o said:
 
 On Sat, Oct 27, 2012 at 01:45:25PM +0100, Nix wrote:
 Ah! it's turned on by journal_async_commit. OK, that alone argues
 against use of journal_async_commit, tested or not, and I'd not have
 turned it on if I'd noticed that.

 (So, the combinations I'll be trying for effect on this bug are:

  journal_async_commit (as now)
  journal_checksum
  none

 Can you also check and see whether the presence or absence of
 nobarrier makes a difference?
 
 Done. (Also checked the effect of your patches posted earlier this week:
 no effect, I'm afraid, certainly not under the fails-even-on-3.6.1 test
 I was carrying out, umount -l'ing /var as the very last thing I did
 before /sbin/reboot -f.)
 
 nobarrier makes a difference that I, at least, did not expect:
 
 [no options]No corruption
 
 nobarrier   No corruption
 
   journal_checksum  Corruption
 Corrupted transaction, journal aborted
 
 nobarrier,journal_checksum  Corruption
 Corrupted transaction, journal aborted
 
   journal_async_commit  Corruption
 Corrupted transaction, journal aborted
 
 nobarrier,journal_async_commit  Corruption
 No corrupted transaction or aborted journal

That's what we needed.  Woulda been great a few days ago ;)

In my testing journal_checksum is broken, and my bisection seems to
implicate

commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31
Author: Theodore Ts'o ty...@mit.edu
Date:   Mon Feb 6 20:12:03 2012 -0500

ext4: fold ext4_claim_inode into ext4_new_inode

as the culprit.  I haven't had time to look into why, yet.

-Eric

 I didn't expect the last case at all, and it adequately explains why you
 are mostly seeing corrupted journal messages in your tests but I was
 not. It also explains why when I saw this for the first time I was able
 to mount the resulting corrupted filesystem read-write and corrupt it
 further before I noticed that anything was wrong.
 
 It is also clear that journal_checksum and all that relies on it is
 worse than useless right now, as Eric reported while I was testing this.
 It should probably be marked CONFIG_BROKEN in future 3.[346].* stable
 kernels, if CONFIG_BROKEN existed anymore, which it doesn't.
 
 It's a shame journal_async_commit depends on a broken feature: it might
 be notionally unsafe but on some of my systems (without nobarrier or
 flashy caching controllers) it was associated with a noticeable speedup
 of metadata-heavy workloads -- though that was way back in 2009...
 however, safety first definitely applies in this case.
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-27 Thread Eric Sandeen
On 10/27/12 4:19 PM, Eric Sandeen wrote:
 On 10/27/12 1:47 PM, Nix wrote:
 On 27 Oct 2012, Theodore Ts'o said:

 On Sat, Oct 27, 2012 at 01:45:25PM +0100, Nix wrote:
 Ah! it's turned on by journal_async_commit. OK, that alone argues
 against use of journal_async_commit, tested or not, and I'd not have
 turned it on if I'd noticed that.

 (So, the combinations I'll be trying for effect on this bug are:

  journal_async_commit (as now)
  journal_checksum
  none

 Can you also check and see whether the presence or absence of
 nobarrier makes a difference?

 Done. (Also checked the effect of your patches posted earlier this week:
 no effect, I'm afraid, certainly not under the fails-even-on-3.6.1 test
 I was carrying out, umount -l'ing /var as the very last thing I did
 before /sbin/reboot -f.)

 nobarrier makes a difference that I, at least, did not expect:

 [no options]No corruption

 nobarrier   No corruption

   journal_checksum  Corruption
 Corrupted transaction, journal aborted
 
 nobarrier,journal_checksum  Corruption
 Corrupted transaction, journal aborted

   journal_async_commit  Corruption
 Corrupted transaction, journal aborted

 nobarrier,journal_async_commit  Corruption
 No corrupted transaction or aborted journal
 
 That's what we needed.  Woulda been great a few days ago ;)
 
 In my testing journal_checksum is broken, and my bisection seems to
 implicate
 
 commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31
 Author: Theodore Ts'o ty...@mit.edu
 Date:   Mon Feb 6 20:12:03 2012 -0500
 
 ext4: fold ext4_claim_inode into ext4_new_inode
 
 as the culprit.  I haven't had time to look into why, yet.

It looks like the inode_bitmap_bh is being modified outside a transaction:

ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh-b_data);

It needs a get_write_access / handle_dirty_metadata pair around it.

A hacky patch like this seems to work but it was done 5mins before I have
to be out the door to dinner so it probably needs cleanup or at least
scrutiny.

[PATCH] ext4: fix unjournaled inode bitmap modification

commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 modified this function
such that the inode bitmap was being modified outside a transaction,
which could lead to corruption, and was discovered when journal_checksum
found a bad checksum in the journal.

Signed-off-by: Eric Sandeen sand...@redhat.com
---

--- ialloc.c.reverted2  2012-10-27 17:31:20.351537073 -0500
+++ ialloc.c2012-10-27 17:40:18.643553576 -0500
@@ -669,6 +669,10 @@
inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
if (!inode_bitmap_bh)
goto fail;
+   BUFFER_TRACE(inode_bitmap_bh, get_write_access);
+   err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
+   if (err)
+   goto fail;
 
 repeat_in_this_group:
ino = ext4_find_next_zero_bit((unsigned long *)
@@ -690,6 +694,10 @@
ino++;  /* the inode bitmap is zero-based */
if (!ret2)
goto got; /* we grabbed the inode! */
+   BUFFER_TRACE(inode_bitmap_bh, call 
ext4_handle_dirty_metadata);
+   err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
+   if (err)
+   goto fail;
if (ino  EXT4_INODES_PER_GROUP(sb))
goto repeat_in_this_group;
}




 -Eric
 
 I didn't expect the last case at all, and it adequately explains why you
 are mostly seeing corrupted journal messages in your tests but I was
 not. It also explains why when I saw this for the first time I was able
 to mount the resulting corrupted filesystem read-write and corrupt it
 further before I noticed that anything was wrong.

 It is also clear that journal_checksum and all that relies on it is
 worse than useless right now, as Eric reported while I was testing this.
 It should probably be marked CONFIG_BROKEN in future 3.[346].* stable
 kernels, if CONFIG_BROKEN existed anymore, which it doesn't.

 It's a shame journal_async_commit depends on a broken feature: it might
 be notionally unsafe but on some of my systems (without nobarrier or
 flashy caching controllers) it was associated with a noticeable speedup
 of metadata-heavy workloads -- though that was way back in 2009...
 however, safety first definitely applies in this case.

 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Jim Rees
Theodore Ts'o wrote:

  The problem is this code isn't done yet, and journal_checksum is
  really not ready for prime time.  When it is ready, my plan is to wire
  it up so it is enabled by default; at the moment, it was intended for
  developer experimentation only.  As I said, it's my fault for not
  clearly labelling it "Not for you!", or putting it under an #ifdef to
  prevent unwary civilians from coming across the feature and saying,
  "oooh, shiny!" and turning it on.  :-(

Perhaps a word or two in the mount man page would be appropriate?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Theodore Ts'o
On Fri, Oct 26, 2012 at 10:19:21PM +0100, Nix wrote:
> > prevent unwary civilians from coming across the feature and saying,
> > "oooh, shiny!" and turning it on.  :-(
> 
> Or having it turned on by default either, which seems to be the case
> now.

Huh?  It's not turned on by default.  If you mount with no mount
options, journal checksums are *not* turned on.

 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Martin

On 10/26/2012 11:10 PM, Theodore Ts'o wrote:

This looks very different.  The symptoms are quite different, and it's
most likely that an unclean shutdown is involved.  In your case,
you're doing clean shutdowns, with some suspend/resume cycles thrown
in.


No no, the case I reported was triggered by an unclean shutdown: my son 
hitting the power button after a system crash, or more likely when the 
graphics subsystem became unresponsive.



Are you running e2fsck to fix the file system consistency problems;
what is e2fsck reporting?


by now it attests a bill of clean health. at first it reported issues 
the precise nature of which escaping my memory, fixed them, and after 
the next reboot reported some more issues which again were fixed. Had I 
known this will look similar to a prominent issue I would have paid more 
attention.



Do you need to have a suspend/resume in order to trigger the problem?


no, I just mentioned the suspend/resume cycles to explain what is going 
on in the syslog, which I didn't attach in the end. During the period of 
the problem building up there was no suspend/resume event.



This could very be some kind of hardware problem or kernel bug related
to suspend/resume.  Unfortunately, many different problems get noticed
by the file system, but the root cause is can often be something else;
a hardware problem, or a bug somewhere else in the kernel.


I hear what you are saying. I just want to add that the hardware has 
survived the past two or three years despite suspend/resume and the odd 
abusive treatment (like unclean shutdown by non-techie users). I tend to 
keep the kernel, patches, modules and user land up to date.




Regards,

- Ted

P.S.  Can you do us a favor and start a separate mail thread with the
information reposted?  It's can get hard to track different cases when
a lot of people assume that their random failure (some of which are
hardware problems) are related to the issue we are trying to track
down in this mail thread and then they all pile onto the same mail
thread or the same web forum --- one of the reasons why I detest
Ubuntu Launchpad.  Thanks!!


Shall do.

cu Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Nix
On 26 Oct 2012, Theodore Ts'o uttered the following:

> The plan is that eventually, we will have checksums on a
> per-journalled block basis, instead of a per-commit basis, and when we
> get a failed checksum, we skip the replay of that block,

But not of everything it implies, since that's quite tricky to track
down (it's basically the same work needed for softupdates, but in
reverse). Hence the e2fsck check, I suppose.

> prevent unwary civilians from coming across the feature and saying,
> "oooh, shiny!" and turning it on.  :-(

Or having it turned on by default either, which seems to be the case
now.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Theodore Ts'o
> This isn't the first time that journal_checksum has proven problematic.
> It's a shame that we're stuck between two error-inducing stools here...

The problem is that it currently bails out be aborting the entire
journal replay, and the file system will get left in a mess when it
does that.  It's actually safer today to just be blissfully ignorant
of a corrupted block in the journal, than to have the journal getting
aborted mid-replay when we detect a corrupted commit.

The plan is that eventually, we will have checksums on a
per-journalled block basis, instead of a per-commit basis, and when we
get a failed checksum, we skip the replay of that block, but we keep
going and replay all of the other blocks and commits.  We'll then set
the "file system corrupted" bit and force an e2fsck check.

The problem is this code isn't done yet, and journal_checksum is
really not ready for prime time.  When it is ready, my plan is to wire
it up so it is enabled by default; at the moment, it was intended for
developer experimentation only.  As I said, it's my fault for not
clearly labelling it "Not for you!", or putting it under an #ifdef to
prevent unwary civilians from coming across the feature and saying,
"oooh, shiny!" and turning it on.  :-(

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Theodore Ts'o
This looks very different.  The symptoms are quite different, and it's
most likely that an unclean shutdown is involved.  In your case,
you're doing clean shutdowns, with some suspend/resume cycles thrown
in.  Also, kernel version 3.5.5 doesn't have the commits that were
added between 3.6.1 and 3.6.3.

Are you running e2fsck to fix the file system consistency problems;
what is e2fsck reporting?

Do you need to have a suspend/resume in order to trigger the problem?

This could very be some kind of hardware problem or kernel bug related
to suspend/resume.  Unfortunately, many different problems get noticed
by the file system, but the root cause is can often be something else;
a hardware problem, or a bug somewhere else in the kernel.

Regards,

- Ted

P.S.  Can you do us a favor and start a separate mail thread with the
information reposted?  It's can get hard to track different cases when
a lot of people assume that their random failure (some of which are
hardware problems) are related to the issue we are trying to track
down in this mail thread and then they all pile onto the same mail
thread or the same web forum --- one of the reasons why I detest
Ubuntu Launchpad.  Thanks!!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Nix
On 26 Oct 2012, Theodore Ts'o stated:

> On Fri, Oct 26, 2012 at 09:37:08PM +0100, Nix wrote:
>> 
>> I can reproduce this on a small filesystem and stick the image somewhere
>> if that would be of any use to anyone. (If I'm very lucky, merely making
>> this offer will make the problem go away. :} )
>
> I'm not sure the image is going to be that useful.  What we really
> need to do is to get a reliable reproduction of what _you_ are seeing.
>
> It's clear from Eric's experiments that journal_checksum is dangerous.
> 
> That's why one of the things I asked you to do when you had time was
> to see if you could reproduce the problem you are seeing w/o
> nobarrier,journal_checksum,journal_async_commit.

OK. Will do tomorrow.

> The other experiment that would be really useful if you could do is to
> try to apply these two patches which I sent earlier this week:
>
> [PATCH 1/2] ext4: revert "jbd2: don't write superblock when if its empty
> [PATCH 2/2] ext4: fix I/O error when unmounting an ro file system
>
> ... and see if they make a difference.

As of tomorrow I'll be able to reboot without causing a riot: I'll test
it then. (Sorry for the delay :( )

>   So I really don't want
> to push these patches to Linus until I get confirmation that they make
> a difference to *somebody*.

Agreed.

This isn't the first time that journal_checksum has proven problematic.
It's a shame that we're stuck between two error-inducing stools here...

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Theodore Ts'o
On Fri, Oct 26, 2012 at 09:37:08PM +0100, Nix wrote:
> 
> I can reproduce this on a small filesystem and stick the image somewhere
> if that would be of any use to anyone. (If I'm very lucky, merely making
> this offer will make the problem go away. :} )

I'm not sure the image is going to be that useful.  What we really
need to do is to get a reliable reproduction of what _you_ are seeing.

It's clear from Eric's experiments that journal_checksum is dangerous.
In fact, I will likely put it under an #ifdef EXT4_EXPERIMENTAL to try
to discourage people from using it in the future.  There are things
I've been planning on doing to make it be safer, but there's a very
good *reason* that both journal_checksum and journal_async_commit are
not on by default.

That's why one of the things I asked you to do when you had time was
to see if you could reproduce the problem you are seeing w/o
nobarrier,journal_checksum,journal_async_commit.

The other experiment that would be really useful if you could do is to
try to apply these two patches which I sent earlier this week:

[PATCH 1/2] ext4: revert "jbd2: don't write superblock when if its empty
[PATCH 2/2] ext4: fix I/O error when unmounting an ro file system

... and see if they make a difference.

If they don't make a difference, I don't want to apply patches just
for placebo/PR reasons.  And for Eric at least, he can reproduce the
journal checksum error followed by fairly significant corruption
reported by e2fsck with journal_checksum, and the presence or absense
of these patches make no difference for him.  So I really don't want
to push these patches to Linus until I get confirmation that they make
a difference to *somebody*.

Regards,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Nix
On 26 Oct 2012, Martin said:

> On 10/26/2012 10:24 PM, Nix wrote:
>> On 26 Oct 2012, Martin spake thusly:
>>> Computer is booted again in order to copy a few files to memory stick. 
>>> Unbeknownst to me, the following entries are logged in the
>>> system log:
>>>
>>> Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5): 
>>> add_dirent_to_buf:1587: inode #655361: block 2629945: comm mount: bad
>>> entry in directory: rec_len % 4 != 0 - offset=360(360), inode=655682, 
>>> rec_len=18, name_len=5
>>> Oct 15 20:00:16 harold kernel: Aborting journal on device sda5-8.
>>> Oct 15 20:00:16 harold kernel: EXT4-fs (sda5): Remounting filesystem 
>>> read-only
>>> Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5) in 
>>> ext4_evict_inode:238: Journal has aborted
>>> Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5) in 
>>> ext4_create:2120: IO failure
>>
>> That's an interesting failure, but looks slightly different to what I
>> saw. No bad directory entries, no aborted journals: a replayed journal
>> and subsequent corruption. Still damaged though, and after a journal
>> abort I'm not surprised you had problems!
>
> So my corrupt journal is simply the result of a user turning off the machine 
> at a bad point in time? That's scary. In that scenario
> even the option data=journal wouldn't save me from harm, would it?

No, I think that's probably a bug -- but I don't know if it's the same
bug: the symptoms are slightly different.

(Note that some hard drives in the distant past had been known to write
rubbish if powered down during a write. I don't think this has been true
for a good decade or so, though.)

>> It's hard to reason about a kernel that's had *that* massive lump of
>> binary junk applied to it, alas. This may or may not be the same
>> problem: it has some common features with what I see, but not all.
>
> true, i normally re-create problems with vanilla kernels before
> reporting them. In this case I was cleanly sniped with no chance of
> re-play so far.

True. I'm stuck with a problem that I can only currently reproduce on
physical hardware myself :( In addition to seeing if Ted's proposed
patch reduces the frequency of corruption, I'll be doing some tests this
weekend with LVM block device suspension and subsequent reboots to see
if that causes similar symptoms even in virtualization.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Martin

On 10/26/2012 10:24 PM, Nix wrote:

On 26 Oct 2012, Martin spake thusly:

[...]

I have studied my corruption problem more closely and can give you a
description of what happened below. Would you say this may be the same
bug?


No. You want to keep up with the thread. Ted's first educated guess is
not always guaranteed to be correct (though this is rare).


OK




Oct 15 19:56:12

Computer is booted again in order to copy a few files to memory stick. 
Unbeknownst to me, the following entries are logged in the
system log:

Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5): 
add_dirent_to_buf:1587: inode #655361: block 2629945: comm mount: bad
entry in directory: rec_len % 4 != 0 - offset=360(360), inode=655682, 
rec_len=18, name_len=5
Oct 15 20:00:16 harold kernel: Aborting journal on device sda5-8.
Oct 15 20:00:16 harold kernel: EXT4-fs (sda5): Remounting filesystem read-only
Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5) in 
ext4_evict_inode:238: Journal has aborted
Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5) in ext4_create:2120: 
IO failure


That's an interesting failure, but looks slightly different to what I
saw. No bad directory entries, no aborted journals: a replayed journal
and subsequent corruption. Still damaged though, and after a journal
abort I'm not surprised you had problems!


So my corrupt journal is simply the result of a user turning off the 
machine at a bad point in time? That's scary. In that scenario even the 
option data=journal wouldn't save me from harm, would it?


Funny this happens to someone who has always said that robustness is the 
most important quality of a filesystem (and who thinks data=writeback is 
madness).





   I will try to rename them to their
proper name on another machine, and restore them on the target
machine. However, due to the sheer number this might take forever.


I relearned this week that backups are good.


Backups are good, and always too old.




Also I am worried the problem might re-surface, as it has neither been
identified nor fixed.


I'm seeing it on almost every reboot.


Indeed the symptoms look different.




NB: kernel was v3.5.5


Hm, this provides possible evidence that the problem does indeed extend
into 3.5.x.


with CK1 and BFQ patches, tainted by nvidia module.


It's hard to reason about a kernel that's had *that* massive lump of
binary junk applied to it, alas. This may or may not be the same
problem: it has some common features with what I see, but not all.



true, i normally re-create problems with vanilla kernels before 
reporting them. In this case I was cleanly sniped with no chance of 
re-play so far.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Nix
On 26 Oct 2012, Eric Sandeen outgrape:

> On 10/23/12 3:57 PM, Nix wrote:
>> The only unusual thing about the filesystems on this machine are that
>> they have hardware RAID-5 (using the Areca driver), so I'm mounting with
>> 'nobarrier': the full set of options for all my ext4 filesystems are:
>> 
>> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,
>> usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota
>
> Out of curiosity, when I test log replay with the journal_checksum option, I
> almost always get something like:
>
> [  999.917805] JBD2: journal transaction 84121 on dm-1-8 is corrupt.
> [  999.923904] EXT4-fs (dm-1): error loading journal
>
> after a simulated crash & log replay.
>
> Do you see anything like that in your logs?

I'm not seeing any corrupt journals or abort messages at all. The
journal claims to be fine, but plainly isn't.

I can reproduce this on a small filesystem and stick the image somewhere
if that would be of any use to anyone. (If I'm very lucky, merely making
this offer will make the problem go away. :} )

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Eric Sandeen
On 10/23/12 3:57 PM, Nix wrote:
> [Bruce, Trond, I fear it may be hard for me to continue chasing this NFS
>  lockd crash as long as ext4 on 3.6.3 is hosing my filesystems like
>  this. Apologies.]



> The only unusual thing about the filesystems on this machine are that
> they have hardware RAID-5 (using the Areca driver), so I'm mounting with
> 'nobarrier': the full set of options for all my ext4 filesystems are:
> 
> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,
> usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota

Out of curiosity, when I test log replay with the journal_checksum option, I
almost always get something like:

[  999.917805] JBD2: journal transaction 84121 on dm-1-8 is corrupt.
[  999.923904] EXT4-fs (dm-1): error loading journal

after a simulated crash & log replay.

Do you see anything like that in your logs?



Thanks,
-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Nix
On 26 Oct 2012, Martin spake thusly:

> On 10/24/2012 07:38 PM, Martin wrote:
>> On 10/24/2012 01:40 AM, Nix wrote:
>>
>>> It's true that in less than a week
>>> probably not all that many people have rebooted often enough to trip
>>> over this.
>>>
>>> I hope.
>>>
>>
>> [previous bug report]
>
> First off let me apologize for not having the right follow-up headers,
> but I am not subscribed and I read the list behind an NNTP gateway.
>
> I have studied my corruption problem more closely and can give you a
> description of what happened below. Would you say this may be the same
> bug?

No. You want to keep up with the thread. Ted's first educated guess is
not always guaranteed to be correct (though this is rare).

> Oct 15 19:56:12
>
> Computer is booted again in order to copy a few files to memory stick. 
> Unbeknownst to me, the following entries are logged in the
> system log:
>
> Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5): 
> add_dirent_to_buf:1587: inode #655361: block 2629945: comm mount: bad
> entry in directory: rec_len % 4 != 0 - offset=360(360), inode=655682, 
> rec_len=18, name_len=5
> Oct 15 20:00:16 harold kernel: Aborting journal on device sda5-8.
> Oct 15 20:00:16 harold kernel: EXT4-fs (sda5): Remounting filesystem read-only
> Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5) in 
> ext4_evict_inode:238: Journal has aborted
> Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5) in 
> ext4_create:2120: IO failure

That's an interesting failure, but looks slightly different to what I
saw. No bad directory entries, no aborted journals: a replayed journal
and subsequent corruption. Still damaged though, and after a journal
abort I'm not surprised you had problems!

>   I will try to rename them to their
> proper name on another machine, and restore them on the target
> machine. However, due to the sheer number this might take forever.

I relearned this week that backups are good.

> Also I am worried the problem might re-surface, as it has neither been
> identified nor fixed.

I'm seeing it on almost every reboot.

> NB: kernel was v3.5.5

Hm, this provides possible evidence that the problem does indeed extend
into 3.5.x.

> with CK1 and BFQ patches, tainted by nvidia module.

It's hard to reason about a kernel that's had *that* massive lump of
binary junk applied to it, alas. This may or may not be the same
problem: it has some common features with what I see, but not all.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Martin

On 10/24/2012 07:38 PM, Martin wrote:

On 10/24/2012 01:40 AM, Nix wrote:


It's true that in less than a week
probably not all that many people have rebooted often enough to trip
over this.

I hope.



[previous bug report]


First off let me apologize for not having the right follow-up headers, 
but I am not subscribed and I read the list behind an NNTP gateway.


I have studied my corruption problem more closely and can give you a 
description of what happened below. Would you say this may be the same bug?


Thx and regards,

Martin

-- snip ---

Storyboard for my root filesystem crash (source: system logs and memory)


Oct 13 07:48:15

Computer is booted. Computer is then suspended and resumed a few times.


Oct 15 18:43:19

Final resume event before the issue starts. At some point prior to the 
next timestamp the computer freezes. Probably just the video hardware 
becoming unresponsive, but the teenage user does not know about ssh or 
alt-sysreq and decides to turn the killswitch.


He then remembers he is supposed to do a clean shutdown at all times and 
boots the computer again in order to perform a clean shutdown:



Oct 15 19:04:20

Computer is booted for the sole reason to be shut down again.


Oct 15 19:05:15

Computer halts. Nothing unusual in the system logs.


Oct 15 19:56:12

Computer is booted again in order to copy a few files to memory stick. 
Unbeknownst to me, the following entries are logged in the system log:


Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5): 
add_dirent_to_buf:1587: inode #655361: block 2629945: comm mount: bad 
entry in directory: rec_len % 4 != 0 - offset=360(360), inode=655682, 
rec_len=18, name_len=5

Oct 15 20:00:16 harold kernel: Aborting journal on device sda5-8.
Oct 15 20:00:16 harold kernel: EXT4-fs (sda5): Remounting filesystem 
read-only
Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5) in 
ext4_evict_inode:238: Journal has aborted
Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5) in 
ext4_create:2120: IO failure
Oct 15 20:00:16 harold hp-systray: hp-systray[1594]: warning: No hp: or 
hpfax: devices found in any installed CUPS queue. Exiting.
Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5): 
search_dirblock:1098: inode #655361: block 2629945: comm 
dbus-daemon-lau: bad entry in directory: rec_len % 4 != 0 - 
offset=360(8552), inode=655682, rec_len=18, name_len=5
Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5): 
search_dirblock:1098: inode #655361: block 2629945: comm 
dbus-daemon-lau: bad entry in directory: rec_len % 4 != 0 - 
offset=360(8552), inode=655682, rec_len=18, name_len=5

Oct 15 20:01:06 harold udevd[955]: specified group 'plugdev' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'netdev' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'tty' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'dialout' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'kmem' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'video' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'audio' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'lp' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'disk' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'floppy' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'cdrom' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'tape' unknown
Oct 15 20:01:07 harold kernel: sd 8:0:0:0: >[sdc] No Caching mode page 
present
Oct 15 20:01:07 harold kernel: sd 8:0:0:0: >[sdc] Assuming drive cache: 
write through
Oct 15 20:01:07 harold kernel: sd 8:0:0:0: >[sdc] No Caching mode page 
present
Oct 15 20:01:07 harold kernel: sd 8:0:0:0: >[sdc] Assuming drive cache: 
write through
Oct 15 20:01:07 harold kernel: sd 8:0:0:0: >[sdc] No Caching mode page 
present
Oct 15 20:01:07 harold kernel: sd 8:0:0:0: >[sdc] Assuming drive cache: 
write through
Oct 15 20:01:19 harold udisksd[1710]: Mounted /dev/sdc1 at 
/run/media/jan/INTENSO on behalf of uid 1002
Oct 15 20:01:21 harold kernel: EXT4-fs error (device sda5): 
htree_dirblock_to_tree:861: inode #655361: block 2629945: comm pool: bad 
entry in directory: rec_len % 4 != 0 - offset=360(8552), inode=655682, 
rec_len=18, name_len=5

Oct 15 20:01:59 harold udevd[955]: specified group 'plugdev' unknown
Oct 15 20:01:59 harold udevd[955]: specified group 'netdev' unknown
Oct 15 20:01:59 harold udevd[955]: specified group 'tty' unknown
Oct 15 20:01:59 harold udevd[955]: specified group 'dialout' unknown
Oct 15 20:01:59 harold udevd[955]: specified group 'kmem' unknown
Oct 15 20:01:59 harold udevd[955]: specified group 'video' unknown
Oct 15 20:01:59 harold udevd[955]: specified group 'audio' unknown
Oct 15 20:01:59 harold udevd[955]: specified group 'lp' unknown
Oct 15 20:01:59 harold udevd[955]: specified group 'disk' unknown
Oct 15 20:01:59 

Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Eric Sandeen
On 10/24/12 3:17 PM, Jan Kara wrote:
> On Tue 23-10-12 19:57:09, Eric Sandeen wrote:
>> On 10/23/12 5:19 PM, Theodore Ts'o wrote:
>>> On Tue, Oct 23, 2012 at 09:57:08PM +0100, Nix wrote:

 It is now quite clear that this is a bug introduced by one or more of
 the post-3.6.1 ext4 patches (which have all been backported at least to
 3.5, so the problem is probably there too).

 [   60.290844] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: 
 group 202, 1583 clusters in bitmap, 1675 in gd
 [   60.291426] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 
 0). There's a risk of filesystem corruption in case of system crash.

>>>
>>> I think I've found the problem.  I believe the commit at fault is commit
>>> 14b4ed22a6 (upstream commit eeecef0af5e):
>>>
>>> jbd2: don't write superblock when if its empty
>>>
>>> which first appeared in v3.6.2.
>>>
>>> The reason why the problem happens rarely is that the effect of the
>>> buggy commit is that if the journal's starting block is zero, we fail
>>> to truncate the journal when we unmount the file system.  This can
>>> happen if we mount and then unmount the file system fairly quickly,
>>> before the log has a chance to wrap.After the first time this has
>>> happened, it's not a disaster, since when we replay the journal, we'll
>>> just replay some extra transactions.  But if this happens twice, the
>>> oldest valid transaction will still not have gotten updated, but some
>>> of the newer transactions from the last mount session will have gotten
>>> written by the very latest transacitons, and when we then try to do
>>> the extra transaction replays, the metadata blocks can end up getting
>>> very scrambled indeed.
>>
>> I'm stumped by this; maybe Ted can see if I'm missing something.
>>
>> (and Nix, is there anything special about your fs?  Any nondefault
>> mkfs or mount options, external journal, inordinately large fs, or
>> anything like that?)
>>
>> The suspect commit added this in jbd2_mark_journal_empty():
>>
>> /* Is it already empty? */
>> if (sb->s_start == 0) {
>> read_unlock(>j_state_lock);
>> return;
>> }
>>
>> thereby short circuiting the function.
>>
>> But Ted's suggestion that mounting the fs, doing a little work, and
>> unmounting before we wrap would lead to this doesn't make sense to
>> me.  When I do a little work, s_start is at 1, not 0.  We start
>> the journal at s_first:
>>
>> load_superblock()
>>  journal->j_first = be32_to_cpu(sb->s_first);
>>
>> And when we wrap the journal, we wrap back to j_first:
>>
>> jbd2_journal_next_log_block():
>> if (journal->j_head == journal->j_last)
>> journal->j_head = journal->j_first;
>>
>> and j_first comes from s_first, which is set at journal creation
>> time to be "1" for an internal journal.
>>
>> So s_start == 0 sure looks special to me; so far I can only see that
>> we get there if we've been through jbd2_mark_journal_empty() already,
>> though I'm eyeballing jbd2_journal_get_log_tail() as well.
>>
>> Ted's proposed patch seems harmless but so far I don't understand
>> what problem it fixes, and I cannot recreate getting to
>> jbd2_mark_journal_empty() with a dirty log and s_start == 0.
>   Agreed. I rather thing we might miss journal->j_flags |= JBD2_FLUSHED
> when shortcircuiting jbd2_mark_journal_empty(). But I still don't exactly
> see how that would cause the corruption...

Agreed, except so far I cannot see any way to get here with s_start == 0
without ALREADY having JBD2_FLUSHED set.  Can you?

Anyway, I think the problem is still poorly understood; lots of random facts
floating about, and a pretty weird usecase with nonstandard/dangerous mount
options.  I do want to figure out what regressed (if anything) but so far
this investigation doesn't seem very methodical.

-Eric

>   Honza
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Nix
On 26 Oct 2012, Theodore Ts'o spake thusly:

> On Thu, Oct 25, 2012 at 08:11:12PM -0400, Ric Wheeler wrote:
>> 
>> Sending this just to you two to avoid embarrassing myself if I
>> misread the thread, but
>> 
>> Can we reproduce this with any other hardware RAID card? Or with MD?
>
> There was another user who reported very similar corruption using
> 3.6.2 using USB thumb drive.  I can't be certain that it's the same
> bug that's being triggered, but the symptoms were identical.

I now suspect it's the same bug, triggered in a different way, but also
by a block-layer problem -- instead of the block device driver not
blocking while the umount finishes (or throwing some of the data umount
writes away, whichever it is, not yet known), the block device goes away
because someone pulled it out of the USB socket. In any case, it appears
that an ext4 umount being interrupted while data is being written does
bad, bad things to the filesystem.

>> If we cannot reproduce this in other machines, why assume this is an
>> ext4 issue and not a hardware firmware bug?

A tad unlikely. Why would a firmware bug show up only at the instant of
reboot? Why would it show up as a lack of blocking on the kernel side? I
assure you that if you write lots of data to this controller normally,
you will end up blocking :) I can completely believe that it's an arcmsr
driver bug though. If it was an ext4 bug, it would surely be
reproducible in virtualization, or on different hardware, or something
like that.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Nix
On 26 Oct 2012, Theodore Ts'o spake thusly:

 On Thu, Oct 25, 2012 at 08:11:12PM -0400, Ric Wheeler wrote:
 
 Sending this just to you two to avoid embarrassing myself if I
 misread the thread, but
 
 Can we reproduce this with any other hardware RAID card? Or with MD?

 There was another user who reported very similar corruption using
 3.6.2 using USB thumb drive.  I can't be certain that it's the same
 bug that's being triggered, but the symptoms were identical.

I now suspect it's the same bug, triggered in a different way, but also
by a block-layer problem -- instead of the block device driver not
blocking while the umount finishes (or throwing some of the data umount
writes away, whichever it is, not yet known), the block device goes away
because someone pulled it out of the USB socket. In any case, it appears
that an ext4 umount being interrupted while data is being written does
bad, bad things to the filesystem.

 If we cannot reproduce this in other machines, why assume this is an
 ext4 issue and not a hardware firmware bug?

A tad unlikely. Why would a firmware bug show up only at the instant of
reboot? Why would it show up as a lack of blocking on the kernel side? I
assure you that if you write lots of data to this controller normally,
you will end up blocking :) I can completely believe that it's an arcmsr
driver bug though. If it was an ext4 bug, it would surely be
reproducible in virtualization, or on different hardware, or something
like that.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Eric Sandeen
On 10/24/12 3:17 PM, Jan Kara wrote:
 On Tue 23-10-12 19:57:09, Eric Sandeen wrote:
 On 10/23/12 5:19 PM, Theodore Ts'o wrote:
 On Tue, Oct 23, 2012 at 09:57:08PM +0100, Nix wrote:

 It is now quite clear that this is a bug introduced by one or more of
 the post-3.6.1 ext4 patches (which have all been backported at least to
 3.5, so the problem is probably there too).

 [   60.290844] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: 
 group 202, 1583 clusters in bitmap, 1675 in gd
 [   60.291426] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 
 0). There's a risk of filesystem corruption in case of system crash.


 I think I've found the problem.  I believe the commit at fault is commit
 14b4ed22a6 (upstream commit eeecef0af5e):

 jbd2: don't write superblock when if its empty

 which first appeared in v3.6.2.

 The reason why the problem happens rarely is that the effect of the
 buggy commit is that if the journal's starting block is zero, we fail
 to truncate the journal when we unmount the file system.  This can
 happen if we mount and then unmount the file system fairly quickly,
 before the log has a chance to wrap.After the first time this has
 happened, it's not a disaster, since when we replay the journal, we'll
 just replay some extra transactions.  But if this happens twice, the
 oldest valid transaction will still not have gotten updated, but some
 of the newer transactions from the last mount session will have gotten
 written by the very latest transacitons, and when we then try to do
 the extra transaction replays, the metadata blocks can end up getting
 very scrambled indeed.

 I'm stumped by this; maybe Ted can see if I'm missing something.

 (and Nix, is there anything special about your fs?  Any nondefault
 mkfs or mount options, external journal, inordinately large fs, or
 anything like that?)

 The suspect commit added this in jbd2_mark_journal_empty():

 /* Is it already empty? */
 if (sb-s_start == 0) {
 read_unlock(journal-j_state_lock);
 return;
 }

 thereby short circuiting the function.

 But Ted's suggestion that mounting the fs, doing a little work, and
 unmounting before we wrap would lead to this doesn't make sense to
 me.  When I do a little work, s_start is at 1, not 0.  We start
 the journal at s_first:

 load_superblock()
  journal-j_first = be32_to_cpu(sb-s_first);

 And when we wrap the journal, we wrap back to j_first:

 jbd2_journal_next_log_block():
 if (journal-j_head == journal-j_last)
 journal-j_head = journal-j_first;

 and j_first comes from s_first, which is set at journal creation
 time to be 1 for an internal journal.

 So s_start == 0 sure looks special to me; so far I can only see that
 we get there if we've been through jbd2_mark_journal_empty() already,
 though I'm eyeballing jbd2_journal_get_log_tail() as well.

 Ted's proposed patch seems harmless but so far I don't understand
 what problem it fixes, and I cannot recreate getting to
 jbd2_mark_journal_empty() with a dirty log and s_start == 0.
   Agreed. I rather thing we might miss journal-j_flags |= JBD2_FLUSHED
 when shortcircuiting jbd2_mark_journal_empty(). But I still don't exactly
 see how that would cause the corruption...

Agreed, except so far I cannot see any way to get here with s_start == 0
without ALREADY having JBD2_FLUSHED set.  Can you?

Anyway, I think the problem is still poorly understood; lots of random facts
floating about, and a pretty weird usecase with nonstandard/dangerous mount
options.  I do want to figure out what regressed (if anything) but so far
this investigation doesn't seem very methodical.

-Eric

   Honza
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Martin

On 10/24/2012 07:38 PM, Martin wrote:

On 10/24/2012 01:40 AM, Nix wrote:


It's true that in less than a week
probably not all that many people have rebooted often enough to trip
over this.

I hope.



[previous bug report]


First off let me apologize for not having the right follow-up headers, 
but I am not subscribed and I read the list behind an NNTP gateway.


I have studied my corruption problem more closely and can give you a 
description of what happened below. Would you say this may be the same bug?


Thx and regards,

Martin

-- snip ---

Storyboard for my root filesystem crash (source: system logs and memory)


Oct 13 07:48:15

Computer is booted. Computer is then suspended and resumed a few times.


Oct 15 18:43:19

Final resume event before the issue starts. At some point prior to the 
next timestamp the computer freezes. Probably just the video hardware 
becoming unresponsive, but the teenage user does not know about ssh or 
alt-sysreq and decides to turn the killswitch.


He then remembers he is supposed to do a clean shutdown at all times and 
boots the computer again in order to perform a clean shutdown:



Oct 15 19:04:20

Computer is booted for the sole reason to be shut down again.


Oct 15 19:05:15

Computer halts. Nothing unusual in the system logs.


Oct 15 19:56:12

Computer is booted again in order to copy a few files to memory stick. 
Unbeknownst to me, the following entries are logged in the system log:


Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5): 
add_dirent_to_buf:1587: inode #655361: block 2629945: comm mount: bad 
entry in directory: rec_len % 4 != 0 - offset=360(360), inode=655682, 
rec_len=18, name_len=5

Oct 15 20:00:16 harold kernel: Aborting journal on device sda5-8.
Oct 15 20:00:16 harold kernel: EXT4-fs (sda5): Remounting filesystem 
read-only
Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5) in 
ext4_evict_inode:238: Journal has aborted
Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5) in 
ext4_create:2120: IO failure
Oct 15 20:00:16 harold hp-systray: hp-systray[1594]: warning: No hp: or 
hpfax: devices found in any installed CUPS queue. Exiting.
Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5): 
search_dirblock:1098: inode #655361: block 2629945: comm 
dbus-daemon-lau: bad entry in directory: rec_len % 4 != 0 - 
offset=360(8552), inode=655682, rec_len=18, name_len=5
Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5): 
search_dirblock:1098: inode #655361: block 2629945: comm 
dbus-daemon-lau: bad entry in directory: rec_len % 4 != 0 - 
offset=360(8552), inode=655682, rec_len=18, name_len=5

Oct 15 20:01:06 harold udevd[955]: specified group 'plugdev' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'netdev' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'tty' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'dialout' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'kmem' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'video' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'audio' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'lp' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'disk' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'floppy' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'cdrom' unknown
Oct 15 20:01:06 harold udevd[955]: specified group 'tape' unknown
Oct 15 20:01:07 harold kernel: sd 8:0:0:0: [sdc] No Caching mode page 
present
Oct 15 20:01:07 harold kernel: sd 8:0:0:0: [sdc] Assuming drive cache: 
write through
Oct 15 20:01:07 harold kernel: sd 8:0:0:0: [sdc] No Caching mode page 
present
Oct 15 20:01:07 harold kernel: sd 8:0:0:0: [sdc] Assuming drive cache: 
write through
Oct 15 20:01:07 harold kernel: sd 8:0:0:0: [sdc] No Caching mode page 
present
Oct 15 20:01:07 harold kernel: sd 8:0:0:0: [sdc] Assuming drive cache: 
write through
Oct 15 20:01:19 harold udisksd[1710]: Mounted /dev/sdc1 at 
/run/media/jan/INTENSO on behalf of uid 1002
Oct 15 20:01:21 harold kernel: EXT4-fs error (device sda5): 
htree_dirblock_to_tree:861: inode #655361: block 2629945: comm pool: bad 
entry in directory: rec_len % 4 != 0 - offset=360(8552), inode=655682, 
rec_len=18, name_len=5

Oct 15 20:01:59 harold udevd[955]: specified group 'plugdev' unknown
Oct 15 20:01:59 harold udevd[955]: specified group 'netdev' unknown
Oct 15 20:01:59 harold udevd[955]: specified group 'tty' unknown
Oct 15 20:01:59 harold udevd[955]: specified group 'dialout' unknown
Oct 15 20:01:59 harold udevd[955]: specified group 'kmem' unknown
Oct 15 20:01:59 harold udevd[955]: specified group 'video' unknown
Oct 15 20:01:59 harold udevd[955]: specified group 'audio' unknown
Oct 15 20:01:59 harold udevd[955]: specified group 'lp' unknown
Oct 15 20:01:59 harold udevd[955]: specified group 'disk' unknown
Oct 15 20:01:59 

Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Nix
On 26 Oct 2012, Martin spake thusly:

 On 10/24/2012 07:38 PM, Martin wrote:
 On 10/24/2012 01:40 AM, Nix wrote:

 It's true that in less than a week
 probably not all that many people have rebooted often enough to trip
 over this.

 I hope.


 [previous bug report]

 First off let me apologize for not having the right follow-up headers,
 but I am not subscribed and I read the list behind an NNTP gateway.

 I have studied my corruption problem more closely and can give you a
 description of what happened below. Would you say this may be the same
 bug?

No. You want to keep up with the thread. Ted's first educated guess is
not always guaranteed to be correct (though this is rare).

 Oct 15 19:56:12

 Computer is booted again in order to copy a few files to memory stick. 
 Unbeknownst to me, the following entries are logged in the
 system log:

 Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5): 
 add_dirent_to_buf:1587: inode #655361: block 2629945: comm mount: bad
 entry in directory: rec_len % 4 != 0 - offset=360(360), inode=655682, 
 rec_len=18, name_len=5
 Oct 15 20:00:16 harold kernel: Aborting journal on device sda5-8.
 Oct 15 20:00:16 harold kernel: EXT4-fs (sda5): Remounting filesystem read-only
 Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5) in 
 ext4_evict_inode:238: Journal has aborted
 Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5) in 
 ext4_create:2120: IO failure

That's an interesting failure, but looks slightly different to what I
saw. No bad directory entries, no aborted journals: a replayed journal
and subsequent corruption. Still damaged though, and after a journal
abort I'm not surprised you had problems!

   I will try to rename them to their
 proper name on another machine, and restore them on the target
 machine. However, due to the sheer number this might take forever.

I relearned this week that backups are good.

 Also I am worried the problem might re-surface, as it has neither been
 identified nor fixed.

I'm seeing it on almost every reboot.

 NB: kernel was v3.5.5

Hm, this provides possible evidence that the problem does indeed extend
into 3.5.x.

 with CK1 and BFQ patches, tainted by nvidia module.

It's hard to reason about a kernel that's had *that* massive lump of
binary junk applied to it, alas. This may or may not be the same
problem: it has some common features with what I see, but not all.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Eric Sandeen
On 10/23/12 3:57 PM, Nix wrote:
 [Bruce, Trond, I fear it may be hard for me to continue chasing this NFS
  lockd crash as long as ext4 on 3.6.3 is hosing my filesystems like
  this. Apologies.]

big snip

 The only unusual thing about the filesystems on this machine are that
 they have hardware RAID-5 (using the Areca driver), so I'm mounting with
 'nobarrier': the full set of options for all my ext4 filesystems are:
 
 rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,
 usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota

Out of curiosity, when I test log replay with the journal_checksum option, I
almost always get something like:

[  999.917805] JBD2: journal transaction 84121 on dm-1-8 is corrupt.
[  999.923904] EXT4-fs (dm-1): error loading journal

after a simulated crash  log replay.

Do you see anything like that in your logs?

big snip

Thanks,
-Eric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Nix
On 26 Oct 2012, Eric Sandeen outgrape:

 On 10/23/12 3:57 PM, Nix wrote:
 The only unusual thing about the filesystems on this machine are that
 they have hardware RAID-5 (using the Areca driver), so I'm mounting with
 'nobarrier': the full set of options for all my ext4 filesystems are:
 
 rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,
 usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota

 Out of curiosity, when I test log replay with the journal_checksum option, I
 almost always get something like:

 [  999.917805] JBD2: journal transaction 84121 on dm-1-8 is corrupt.
 [  999.923904] EXT4-fs (dm-1): error loading journal

 after a simulated crash  log replay.

 Do you see anything like that in your logs?

I'm not seeing any corrupt journals or abort messages at all. The
journal claims to be fine, but plainly isn't.

I can reproduce this on a small filesystem and stick the image somewhere
if that would be of any use to anyone. (If I'm very lucky, merely making
this offer will make the problem go away. :} )

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Martin

On 10/26/2012 10:24 PM, Nix wrote:

On 26 Oct 2012, Martin spake thusly:

[...]

I have studied my corruption problem more closely and can give you a
description of what happened below. Would you say this may be the same
bug?


No. You want to keep up with the thread. Ted's first educated guess is
not always guaranteed to be correct (though this is rare).


OK




Oct 15 19:56:12

Computer is booted again in order to copy a few files to memory stick. 
Unbeknownst to me, the following entries are logged in the
system log:

Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5): 
add_dirent_to_buf:1587: inode #655361: block 2629945: comm mount: bad
entry in directory: rec_len % 4 != 0 - offset=360(360), inode=655682, 
rec_len=18, name_len=5
Oct 15 20:00:16 harold kernel: Aborting journal on device sda5-8.
Oct 15 20:00:16 harold kernel: EXT4-fs (sda5): Remounting filesystem read-only
Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5) in 
ext4_evict_inode:238: Journal has aborted
Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5) in ext4_create:2120: 
IO failure


That's an interesting failure, but looks slightly different to what I
saw. No bad directory entries, no aborted journals: a replayed journal
and subsequent corruption. Still damaged though, and after a journal
abort I'm not surprised you had problems!


So my corrupt journal is simply the result of a user turning off the 
machine at a bad point in time? That's scary. In that scenario even the 
option data=journal wouldn't save me from harm, would it?


Funny this happens to someone who has always said that robustness is the 
most important quality of a filesystem (and who thinks data=writeback is 
madness).





   I will try to rename them to their
proper name on another machine, and restore them on the target
machine. However, due to the sheer number this might take forever.


I relearned this week that backups are good.


Backups are good, and always too old.




Also I am worried the problem might re-surface, as it has neither been
identified nor fixed.


I'm seeing it on almost every reboot.


Indeed the symptoms look different.




NB: kernel was v3.5.5


Hm, this provides possible evidence that the problem does indeed extend
into 3.5.x.


with CK1 and BFQ patches, tainted by nvidia module.


It's hard to reason about a kernel that's had *that* massive lump of
binary junk applied to it, alas. This may or may not be the same
problem: it has some common features with what I see, but not all.



true, i normally re-create problems with vanilla kernels before 
reporting them. In this case I was cleanly sniped with no chance of 
re-play so far.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Nix
On 26 Oct 2012, Martin said:

 On 10/26/2012 10:24 PM, Nix wrote:
 On 26 Oct 2012, Martin spake thusly:
 Computer is booted again in order to copy a few files to memory stick. 
 Unbeknownst to me, the following entries are logged in the
 system log:

 Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5): 
 add_dirent_to_buf:1587: inode #655361: block 2629945: comm mount: bad
 entry in directory: rec_len % 4 != 0 - offset=360(360), inode=655682, 
 rec_len=18, name_len=5
 Oct 15 20:00:16 harold kernel: Aborting journal on device sda5-8.
 Oct 15 20:00:16 harold kernel: EXT4-fs (sda5): Remounting filesystem 
 read-only
 Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5) in 
 ext4_evict_inode:238: Journal has aborted
 Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5) in 
 ext4_create:2120: IO failure

 That's an interesting failure, but looks slightly different to what I
 saw. No bad directory entries, no aborted journals: a replayed journal
 and subsequent corruption. Still damaged though, and after a journal
 abort I'm not surprised you had problems!

 So my corrupt journal is simply the result of a user turning off the machine 
 at a bad point in time? That's scary. In that scenario
 even the option data=journal wouldn't save me from harm, would it?

No, I think that's probably a bug -- but I don't know if it's the same
bug: the symptoms are slightly different.

(Note that some hard drives in the distant past had been known to write
rubbish if powered down during a write. I don't think this has been true
for a good decade or so, though.)

 It's hard to reason about a kernel that's had *that* massive lump of
 binary junk applied to it, alas. This may or may not be the same
 problem: it has some common features with what I see, but not all.

 true, i normally re-create problems with vanilla kernels before
 reporting them. In this case I was cleanly sniped with no chance of
 re-play so far.

True. I'm stuck with a problem that I can only currently reproduce on
physical hardware myself :( In addition to seeing if Ted's proposed
patch reduces the frequency of corruption, I'll be doing some tests this
weekend with LVM block device suspension and subsequent reboots to see
if that causes similar symptoms even in virtualization.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Theodore Ts'o
On Fri, Oct 26, 2012 at 09:37:08PM +0100, Nix wrote:
 
 I can reproduce this on a small filesystem and stick the image somewhere
 if that would be of any use to anyone. (If I'm very lucky, merely making
 this offer will make the problem go away. :} )

I'm not sure the image is going to be that useful.  What we really
need to do is to get a reliable reproduction of what _you_ are seeing.

It's clear from Eric's experiments that journal_checksum is dangerous.
In fact, I will likely put it under an #ifdef EXT4_EXPERIMENTAL to try
to discourage people from using it in the future.  There are things
I've been planning on doing to make it be safer, but there's a very
good *reason* that both journal_checksum and journal_async_commit are
not on by default.

That's why one of the things I asked you to do when you had time was
to see if you could reproduce the problem you are seeing w/o
nobarrier,journal_checksum,journal_async_commit.

The other experiment that would be really useful if you could do is to
try to apply these two patches which I sent earlier this week:

[PATCH 1/2] ext4: revert jbd2: don't write superblock when if its empty
[PATCH 2/2] ext4: fix I/O error when unmounting an ro file system

... and see if they make a difference.

If they don't make a difference, I don't want to apply patches just
for placebo/PR reasons.  And for Eric at least, he can reproduce the
journal checksum error followed by fairly significant corruption
reported by e2fsck with journal_checksum, and the presence or absense
of these patches make no difference for him.  So I really don't want
to push these patches to Linus until I get confirmation that they make
a difference to *somebody*.

Regards,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Nix
On 26 Oct 2012, Theodore Ts'o stated:

 On Fri, Oct 26, 2012 at 09:37:08PM +0100, Nix wrote:
 
 I can reproduce this on a small filesystem and stick the image somewhere
 if that would be of any use to anyone. (If I'm very lucky, merely making
 this offer will make the problem go away. :} )

 I'm not sure the image is going to be that useful.  What we really
 need to do is to get a reliable reproduction of what _you_ are seeing.

 It's clear from Eric's experiments that journal_checksum is dangerous.
 
 That's why one of the things I asked you to do when you had time was
 to see if you could reproduce the problem you are seeing w/o
 nobarrier,journal_checksum,journal_async_commit.

OK. Will do tomorrow.

 The other experiment that would be really useful if you could do is to
 try to apply these two patches which I sent earlier this week:

 [PATCH 1/2] ext4: revert jbd2: don't write superblock when if its empty
 [PATCH 2/2] ext4: fix I/O error when unmounting an ro file system

 ... and see if they make a difference.

As of tomorrow I'll be able to reboot without causing a riot: I'll test
it then. (Sorry for the delay :( )

   So I really don't want
 to push these patches to Linus until I get confirmation that they make
 a difference to *somebody*.

Agreed.

This isn't the first time that journal_checksum has proven problematic.
It's a shame that we're stuck between two error-inducing stools here...

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Theodore Ts'o
This looks very different.  The symptoms are quite different, and it's
most likely that an unclean shutdown is involved.  In your case,
you're doing clean shutdowns, with some suspend/resume cycles thrown
in.  Also, kernel version 3.5.5 doesn't have the commits that were
added between 3.6.1 and 3.6.3.

Are you running e2fsck to fix the file system consistency problems;
what is e2fsck reporting?

Do you need to have a suspend/resume in order to trigger the problem?

This could very be some kind of hardware problem or kernel bug related
to suspend/resume.  Unfortunately, many different problems get noticed
by the file system, but the root cause is can often be something else;
a hardware problem, or a bug somewhere else in the kernel.

Regards,

- Ted

P.S.  Can you do us a favor and start a separate mail thread with the
information reposted?  It's can get hard to track different cases when
a lot of people assume that their random failure (some of which are
hardware problems) are related to the issue we are trying to track
down in this mail thread and then they all pile onto the same mail
thread or the same web forum --- one of the reasons why I detest
Ubuntu Launchpad.  Thanks!!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Theodore Ts'o
 This isn't the first time that journal_checksum has proven problematic.
 It's a shame that we're stuck between two error-inducing stools here...

The problem is that it currently bails out be aborting the entire
journal replay, and the file system will get left in a mess when it
does that.  It's actually safer today to just be blissfully ignorant
of a corrupted block in the journal, than to have the journal getting
aborted mid-replay when we detect a corrupted commit.

The plan is that eventually, we will have checksums on a
per-journalled block basis, instead of a per-commit basis, and when we
get a failed checksum, we skip the replay of that block, but we keep
going and replay all of the other blocks and commits.  We'll then set
the file system corrupted bit and force an e2fsck check.

The problem is this code isn't done yet, and journal_checksum is
really not ready for prime time.  When it is ready, my plan is to wire
it up so it is enabled by default; at the moment, it was intended for
developer experimentation only.  As I said, it's my fault for not
clearly labelling it Not for you!, or putting it under an #ifdef to
prevent unwary civilians from coming across the feature and saying,
oooh, shiny! and turning it on.  :-(

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Nix
On 26 Oct 2012, Theodore Ts'o uttered the following:

 The plan is that eventually, we will have checksums on a
 per-journalled block basis, instead of a per-commit basis, and when we
 get a failed checksum, we skip the replay of that block,

But not of everything it implies, since that's quite tricky to track
down (it's basically the same work needed for softupdates, but in
reverse). Hence the e2fsck check, I suppose.

 prevent unwary civilians from coming across the feature and saying,
 oooh, shiny! and turning it on.  :-(

Or having it turned on by default either, which seems to be the case
now.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Martin

On 10/26/2012 11:10 PM, Theodore Ts'o wrote:

This looks very different.  The symptoms are quite different, and it's
most likely that an unclean shutdown is involved.  In your case,
you're doing clean shutdowns, with some suspend/resume cycles thrown
in.


No no, the case I reported was triggered by an unclean shutdown: my son 
hitting the power button after a system crash, or more likely when the 
graphics subsystem became unresponsive.



Are you running e2fsck to fix the file system consistency problems;
what is e2fsck reporting?


by now it attests a bill of clean health. at first it reported issues 
the precise nature of which escaping my memory, fixed them, and after 
the next reboot reported some more issues which again were fixed. Had I 
known this will look similar to a prominent issue I would have paid more 
attention.



Do you need to have a suspend/resume in order to trigger the problem?


no, I just mentioned the suspend/resume cycles to explain what is going 
on in the syslog, which I didn't attach in the end. During the period of 
the problem building up there was no suspend/resume event.



This could very be some kind of hardware problem or kernel bug related
to suspend/resume.  Unfortunately, many different problems get noticed
by the file system, but the root cause is can often be something else;
a hardware problem, or a bug somewhere else in the kernel.


I hear what you are saying. I just want to add that the hardware has 
survived the past two or three years despite suspend/resume and the odd 
abusive treatment (like unclean shutdown by non-techie users). I tend to 
keep the kernel, patches, modules and user land up to date.




Regards,

- Ted

P.S.  Can you do us a favor and start a separate mail thread with the
information reposted?  It's can get hard to track different cases when
a lot of people assume that their random failure (some of which are
hardware problems) are related to the issue we are trying to track
down in this mail thread and then they all pile onto the same mail
thread or the same web forum --- one of the reasons why I detest
Ubuntu Launchpad.  Thanks!!


Shall do.

cu Martin

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Theodore Ts'o
On Fri, Oct 26, 2012 at 10:19:21PM +0100, Nix wrote:
  prevent unwary civilians from coming across the feature and saying,
  oooh, shiny! and turning it on.  :-(
 
 Or having it turned on by default either, which seems to be the case
 now.

Huh?  It's not turned on by default.  If you mount with no mount
options, journal checksums are *not* turned on.

 - Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Jim Rees
Theodore Ts'o wrote:

  The problem is this code isn't done yet, and journal_checksum is
  really not ready for prime time.  When it is ready, my plan is to wire
  it up so it is enabled by default; at the moment, it was intended for
  developer experimentation only.  As I said, it's my fault for not
  clearly labelling it Not for you!, or putting it under an #ifdef to
  prevent unwary civilians from coming across the feature and saying,
  oooh, shiny! and turning it on.  :-(

Perhaps a word or two in the mount man page would be appropriate?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-25 Thread Theodore Ts'o
On Thu, Oct 25, 2012 at 08:11:12PM -0400, Ric Wheeler wrote:
> 
> Sending this just to you two to avoid embarrassing myself if I
> misread the thread, but
> 
> Can we reproduce this with any other hardware RAID card? Or with MD?

There was another user who reported very similar corruption using
3.6.2 using USB thumb drive.  I can't be certain that it's the same
bug that's being triggered, but the symptoms were identical.

> If we cannot reproduce this in other machines, why assume this is an
> ext4 issue and not a hardware firmware bug?
> 
> As an ex-storage guy, this really smells like the hardware raid card
> might be misleading us

It's possible.  The main reason why I took this so seriously was
because of the 2nd, apparently confirming report, with very different
hardware.  That was what was so scary to me, at least at first.

  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-25 Thread Ric Wheeler

On 10/24/2012 12:15 AM, Nix wrote:

On 24 Oct 2012, Eric Sandeen uttered the following:


On 10/23/12 3:57 PM, Nix wrote:

The only unusual thing about the filesystems on this machine are that
they have hardware RAID-5 (using the Areca driver), so I'm mounting with
'nobarrier':

I should have read more.  :(  More questions follow:

* Does the Areca have a battery backed write cache?

Yes (though I'm not powering off, just rebooting). Battery at 100% and
happy, though the lack of power-off means it's not actually getting
used, since the cache is obviously mains-backed as well.


Sending this just to you two to avoid embarrassing myself if I misread the 
thread, but


Can we reproduce this with any other hardware RAID card? Or with MD?

If we cannot reproduce this in other machines, why assume this is an ext4 issue 
and not a hardware firmware bug?


As an ex-storage guy, this really smells like the hardware raid card might be 
misleading us


ric



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-25 Thread Jan Kara
On Tue 23-10-12 19:57:09, Eric Sandeen wrote:
> On 10/23/12 5:19 PM, Theodore Ts'o wrote:
> > On Tue, Oct 23, 2012 at 09:57:08PM +0100, Nix wrote:
> >>
> >> It is now quite clear that this is a bug introduced by one or more of
> >> the post-3.6.1 ext4 patches (which have all been backported at least to
> >> 3.5, so the problem is probably there too).
> >>
> >> [   60.290844] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: 
> >> group 202, 1583 clusters in bitmap, 1675 in gd
> >> [   60.291426] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 
> >> 0). There's a risk of filesystem corruption in case of system crash.
> >>
> > 
> > I think I've found the problem.  I believe the commit at fault is commit
> > 14b4ed22a6 (upstream commit eeecef0af5e):
> > 
> > jbd2: don't write superblock when if its empty
> > 
> > which first appeared in v3.6.2.
> > 
> > The reason why the problem happens rarely is that the effect of the
> > buggy commit is that if the journal's starting block is zero, we fail
> > to truncate the journal when we unmount the file system.  This can
> > happen if we mount and then unmount the file system fairly quickly,
> > before the log has a chance to wrap.After the first time this has
> > happened, it's not a disaster, since when we replay the journal, we'll
> > just replay some extra transactions.  But if this happens twice, the
> > oldest valid transaction will still not have gotten updated, but some
> > of the newer transactions from the last mount session will have gotten
> > written by the very latest transacitons, and when we then try to do
> > the extra transaction replays, the metadata blocks can end up getting
> > very scrambled indeed.
> 
> I'm stumped by this; maybe Ted can see if I'm missing something.
> 
> (and Nix, is there anything special about your fs?  Any nondefault
> mkfs or mount options, external journal, inordinately large fs, or
> anything like that?)
> 
> The suspect commit added this in jbd2_mark_journal_empty():
> 
> /* Is it already empty? */
> if (sb->s_start == 0) {
> read_unlock(>j_state_lock);
> return;
> }
> 
> thereby short circuiting the function.
> 
> But Ted's suggestion that mounting the fs, doing a little work, and
> unmounting before we wrap would lead to this doesn't make sense to
> me.  When I do a little work, s_start is at 1, not 0.  We start
> the journal at s_first:
> 
> load_superblock()
>   journal->j_first = be32_to_cpu(sb->s_first);
> 
> And when we wrap the journal, we wrap back to j_first:
> 
> jbd2_journal_next_log_block():
> if (journal->j_head == journal->j_last)
> journal->j_head = journal->j_first;
> 
> and j_first comes from s_first, which is set at journal creation
> time to be "1" for an internal journal.
> 
> So s_start == 0 sure looks special to me; so far I can only see that
> we get there if we've been through jbd2_mark_journal_empty() already,
> though I'm eyeballing jbd2_journal_get_log_tail() as well.
> 
> Ted's proposed patch seems harmless but so far I don't understand
> what problem it fixes, and I cannot recreate getting to
> jbd2_mark_journal_empty() with a dirty log and s_start == 0.
  Agreed. I rather thing we might miss journal->j_flags |= JBD2_FLUSHED
when shortcircuiting jbd2_mark_journal_empty(). But I still don't exactly
see how that would cause the corruption...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-25 Thread Jan Kara
On Tue 23-10-12 19:57:09, Eric Sandeen wrote:
 On 10/23/12 5:19 PM, Theodore Ts'o wrote:
  On Tue, Oct 23, 2012 at 09:57:08PM +0100, Nix wrote:
 
  It is now quite clear that this is a bug introduced by one or more of
  the post-3.6.1 ext4 patches (which have all been backported at least to
  3.5, so the problem is probably there too).
 
  [   60.290844] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: 
  group 202, 1583 clusters in bitmap, 1675 in gd
  [   60.291426] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 
  0). There's a risk of filesystem corruption in case of system crash.
 
  
  I think I've found the problem.  I believe the commit at fault is commit
  14b4ed22a6 (upstream commit eeecef0af5e):
  
  jbd2: don't write superblock when if its empty
  
  which first appeared in v3.6.2.
  
  The reason why the problem happens rarely is that the effect of the
  buggy commit is that if the journal's starting block is zero, we fail
  to truncate the journal when we unmount the file system.  This can
  happen if we mount and then unmount the file system fairly quickly,
  before the log has a chance to wrap.After the first time this has
  happened, it's not a disaster, since when we replay the journal, we'll
  just replay some extra transactions.  But if this happens twice, the
  oldest valid transaction will still not have gotten updated, but some
  of the newer transactions from the last mount session will have gotten
  written by the very latest transacitons, and when we then try to do
  the extra transaction replays, the metadata blocks can end up getting
  very scrambled indeed.
 
 I'm stumped by this; maybe Ted can see if I'm missing something.
 
 (and Nix, is there anything special about your fs?  Any nondefault
 mkfs or mount options, external journal, inordinately large fs, or
 anything like that?)
 
 The suspect commit added this in jbd2_mark_journal_empty():
 
 /* Is it already empty? */
 if (sb-s_start == 0) {
 read_unlock(journal-j_state_lock);
 return;
 }
 
 thereby short circuiting the function.
 
 But Ted's suggestion that mounting the fs, doing a little work, and
 unmounting before we wrap would lead to this doesn't make sense to
 me.  When I do a little work, s_start is at 1, not 0.  We start
 the journal at s_first:
 
 load_superblock()
   journal-j_first = be32_to_cpu(sb-s_first);
 
 And when we wrap the journal, we wrap back to j_first:
 
 jbd2_journal_next_log_block():
 if (journal-j_head == journal-j_last)
 journal-j_head = journal-j_first;
 
 and j_first comes from s_first, which is set at journal creation
 time to be 1 for an internal journal.
 
 So s_start == 0 sure looks special to me; so far I can only see that
 we get there if we've been through jbd2_mark_journal_empty() already,
 though I'm eyeballing jbd2_journal_get_log_tail() as well.
 
 Ted's proposed patch seems harmless but so far I don't understand
 what problem it fixes, and I cannot recreate getting to
 jbd2_mark_journal_empty() with a dirty log and s_start == 0.
  Agreed. I rather thing we might miss journal-j_flags |= JBD2_FLUSHED
when shortcircuiting jbd2_mark_journal_empty(). But I still don't exactly
see how that would cause the corruption...

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-25 Thread Ric Wheeler

On 10/24/2012 12:15 AM, Nix wrote:

On 24 Oct 2012, Eric Sandeen uttered the following:


On 10/23/12 3:57 PM, Nix wrote:

The only unusual thing about the filesystems on this machine are that
they have hardware RAID-5 (using the Areca driver), so I'm mounting with
'nobarrier':

I should have read more.  :(  More questions follow:

* Does the Areca have a battery backed write cache?

Yes (though I'm not powering off, just rebooting). Battery at 100% and
happy, though the lack of power-off means it's not actually getting
used, since the cache is obviously mains-backed as well.


Sending this just to you two to avoid embarrassing myself if I misread the 
thread, but


Can we reproduce this with any other hardware RAID card? Or with MD?

If we cannot reproduce this in other machines, why assume this is an ext4 issue 
and not a hardware firmware bug?


As an ex-storage guy, this really smells like the hardware raid card might be 
misleading us


ric



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-25 Thread Theodore Ts'o
On Thu, Oct 25, 2012 at 08:11:12PM -0400, Ric Wheeler wrote:
 
 Sending this just to you two to avoid embarrassing myself if I
 misread the thread, but
 
 Can we reproduce this with any other hardware RAID card? Or with MD?

There was another user who reported very similar corruption using
3.6.2 using USB thumb drive.  I can't be certain that it's the same
bug that's being triggered, but the symptoms were identical.

 If we cannot reproduce this in other machines, why assume this is an
 ext4 issue and not a hardware firmware bug?
 
 As an ex-storage guy, this really smells like the hardware raid card
 might be misleading us

It's possible.  The main reason why I took this so seriously was
because of the 2nd, apparently confirming report, with very different
hardware.  That was what was so scary to me, at least at first.

  - Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Nix
On 24 Oct 2012, Theodore Ts'o uttered the following:
> (Keep in mind this is why commercial software corporations like
> Microsoft or Apple generally don't make discussions as they are trying
> to root cause a problem public; sometimes the initial theories can be
> incorrect, and it's unfortunate when misinformation ends up on
> Phoronix or Slashdot, leading to people to panic...  but this is open
> source, so that means we do everything in the open, since that way we
> can all work towards finding the best answer.)

Quite. The first few days of any problem diagnosis are often a process
of taking something from 'oh my god it might be the end of the world' to
'oh look it's really obscure, no wonder nobody has ever seen it before'.

This is quite *definitely* such a problem.

> It's a little bit too early for this meme:
>
> http://memegenerator.net/instance/28936247

It appears I have taken up a new post as the Iraqi Information Minister.
This is why I was disturbed to see the thing hitting Phoronix and then
Slashdot: as the guy whose FSes are being eaten, this is probably not an
easy bug to hit! If it hits, the consequences are serious, but it
doesn't seem to be easy to hit. (I should perhaps have phrased the
subject line better, but I'd just had my $HOME eaten and was rather
stressed out...)

> But do please note that that Fedora !7 users have been using 3.6.2 for
> a while, so if this were an easily triggered bug, (a) Eric and I would
> have managed to reproduce it by now, and (b) lots of people would be
> complaining, since the symptoms of the bug are not subtle.

Quite.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Jannis Achstetter
Am 24.10.2012 23:31, schrieb Theodore Ts'o:
> On Wed, Oct 24, 2012 at 09:13:01PM +0200, Jannis Achstetter wrote:
>>
>> As a "normal linux user" I'm interested in the practical things to do
>> now to avoid data loss. I'm running several systems with 3.6.2 and ext4.
>> Fearing loss of data:
>> - Is there a way to see whether the journal of a specific partition has
>> been wrapped (since mounting) so that umounting and mounting (or doing a
>> reboot to downgrade the kernel) is safe?
> [...]
> (Keep in mind this is why commercial software corporations like
> [...]
> can all work towards finding the best answer.)

I really appreciate this and I like it since although the root-cause
hasn't been found for sure yet, it is a transparent process.
And it's great good thing that we can directly talk to the involved devs
w/o going through 200 layers of marketing and spokesmen (as it were with
the two companies you mentioned).

> It's a little bit too early for this meme:
> http://memegenerator.net/instance/28936247

That's a good one :)

> But do please note that that Fedora !7 users have been using 3.6.2 for
> [...]
> with trailing edge kernel sources.  :-)

Yes, the downside of running Gentoo unstable. But even the "stable" tree
used 3.5.7 and this is the one my NAS uses where I do store my backups.
Nevertheless, your reply eased my mind to a great extend and I'm
thankful for it.
Time for bed now :)

Jannis

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Theodore Ts'o
On Wed, Oct 24, 2012 at 09:13:01PM +0200, Jannis Achstetter wrote:
> 
> As a "normal linux user" I'm interested in the practical things to do
> now to avoid data loss. I'm running several systems with 3.6.2 and ext4.
> Fearing loss of data:
> - Is there a way to see whether the journal of a specific partition has
> been wrapped (since mounting) so that umounting and mounting (or doing a
> reboot to downgrade the kernel) is safe?

My initial analysis of what had been causing the problem now looks
incorrect (or at least incomplete).  Both Eric and I have been unable
to reproduce the failure based on my initial theory of what had been
going on.  So the best information at this point is that it's probably
not related to the file system getting unmounted before the journal
has wrapped.

(Keep in mind this is why commercial software corporations like
Microsoft or Apple generally don't make discussions as they are trying
to root cause a problem public; sometimes the initial theories can be
incorrect, and it's unfortunate when misinformation ends up on
Phoronix or Slashdot, leading to people to panic...  but this is open
source, so that means we do everything in the open, since that way we
can all work towards finding the best answer.)

At the *moment* it looks like it might be related to an unclean
shutdown (i.e., a forced reset or power failure while the file system
is mounted or is in the process of being unmounted).  That being said,
a simply kill -9 of kvm running a test kernel while the file system is
mounted by otherwise quiscient doesn't trigger the problem (I was
trying that last night).

It's a little bit too early for this meme:

http://memegenerator.net/instance/28936247

But do please note that that Fedora !7 users have been using 3.6.2 for
a while, so if this were an easily triggered bug, (a) Eric and I would
have managed to reproduce it by now, and (b) lots of people would be
complaining, since the symptoms of the bug are not subtle.

That's not to say we aren't treating this seriously; but people
shouldn't panic unduly (and if you are using a critical
enterprise/production server on bleeding edge kernels, may I suggest
that this might not be such a good idea; there is a *reason* why
enterprise Linux distro's spend 6-9 months or more just stablizing the
kernel, and being super paranoid about making changes afterwards for
years, and it's not because they enjoy backporting patches and working
with trailing edge kernel sources.  :-)

Regards,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Theodore Ts'o
On Wed, Oct 24, 2012 at 09:45:47PM +0100, Nix wrote:
> 
> It occurs to me that it is possible that this bug hits only those
> filesystems for which a umount has started but been unable to complete.
> If so, this is a relatively rare and unimportant bug which probably hits
> only me and users of slow removable filesystems in the whole world...

Can you verify this?  Does the bug show up if you just hit the power
switch while the system is booted?

How about changing the "sleep 2" to "sleep 0.5"?  (Feel free to
unmount your other partitions, and just leave a test file system
mounted to minimize the chances that you lose partitions that require
hours and hours to restore...)

If you can get a very reliable repro, we might have to ask you to try
the following experiments:

0) Make sure the reliable repro does _not_ work with 3.6.1 booted

1) Try a 3.6.2 kernel

2) (If the problem shows up above) try a 3.6.2 kernel with 14b4ed2 reverted

3) (If the problem shows up above) try a 3.6.2 kernel with all of ext4
   related patches reverted:
92b7722 ext4: fix mtime update in nodelalloc mode
34414b2 ext4: fix fdatasync() for files with only i_size changes
12ebdf0 ext4: always set i_op in ext4_mknod()
22a5672 ext4: online defrag is not supported for journaled files
ba57d9e ext4: move_extent code cleanup
2fdb112 ext4: fix crash when accessing /proc/mounts concurrently
1638f1f ext4: fix potential deadlock in ext4_nonda_switch()
5018ddd ext4: avoid duplicate writes of the backup bg descriptor blocks
256ae46 ext4: don't copy non-existent gdt blocks when resizing
416a688 ext4: ignore last group w/o enough space when resizing instead of 
BUG'ing
14b4ed2 jbd2: don't write superblock when if its empty

4) (If the problem still shows up) then we may need to do a full
   bisect to figure out what is going on

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Jannis Achstetter
Am 24.10.2012 00:19, schrieb Theodore Ts'o:
> The reason why the problem happens rarely is that the effect of the
> buggy commit is that if the journal's starting block is zero, we fail
> to truncate the journal when we unmount the file system.  This can
> happen if we mount and then unmount the file system fairly quickly,
> before the log has a chance to wrap.  After the first time this has
> happened, it's not a disaster, since when we replay the journal, we'll
> just replay some extra transactions.  But if this happens twice, the
> oldest valid transaction will still not have gotten updated, but some
> of the newer transactions from the last mount session will have gotten
> written by the very latest transacitons, and when we then try to do
> the extra transaction replays, the metadata blocks can end up getting
> very scrambled indeed.

Repost. Sorry, I don't mean to spam, I just don't see my first mail
(sent via gmane.org) anywhere, so ...

As a "normal linux user" I'm interested in the practical things to do
now to avoid data loss. I'm running several systems with 3.6.2 and ext4.
Fearing loss of data:
- Is there a way to see whether the journal of a specific partition has
been wrapped (since mounting) so that umounting and mounting (or doing a
reboot to downgrade the kernel) is safe?
- Is there a way to "force" a journal-wrap? Run any
filesystem-benchmark? Which one with what parameters? Or is it unwise
since I might even further corrupt data if I hit the case already?
- Is it wise to umount now and run e2fsck or might I corrupt my files
just by umounting now if the journal hasn't wrapped yet?
- How do you define "fairly quickly"? Of course servers run 24/7 but I
might be using my PC 2-5 hrs a day... Is that a "reboot to soon after
booting"?
- Any more advice you can give to the ordinary user to avoid
fs-corruption? Don't shut down machines for some days? Better down- or
upgrade the kernel?

Best regards,
Jannis Achstetter

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Nix
On 24 Oct 2012, n...@esperi.org.uk spake thusly:
> So, the net effect of this is that normally I get no journal recovery on
> anything at all -- but sometimes, if umounting takes longer than a few
> seconds, I reboot with not everything unmounted, and journal recovery
> kicks in on reboot.

It occurs to me that it is possible that this bug hits only those
filesystems for which a umount has started but been unable to complete.
If so, this is a relatively rare and unimportant bug which probably hits
only me and users of slow removable filesystems in the whole world...

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Nix
On 24 Oct 2012, Eric Sandeen uttered the following:

> On 10/24/2012 02:49 PM, Nix wrote:
>> On 24 Oct 2012, Theodore Ts'o spake thusly:
>>> Toralf, Nix, if you could try applying this patch (at the end of this
>>> message), and let me know how and when the WARN_ON triggers, and if it
>>> does, please send the empty_bug_workaround plus the WARN_ON(1) report.
>>> I know about the case where a file system is mounted and then
>>> immediately unmounted, but we don't think that's the problematic case.
>>> If you see any other cases where WARN_ON is triggering, it would be
>>> really good to know
>> 
>> Confirmed, it triggers. Traceback below.
>
> 
>
> The warn on triggers, but I can't tell - did the corruption still occur
> with Ted's patch?

Yes. I fscked the filesystems in 3.6.1 after rebooting: /var had a
journal replay, and the usual varieties of corruption (free space bitmap
problems and multiply-claimed blocks). (The other filesystems for which
the warning triggered had neither a journal replay nor corruption.
At least one of them, /home, likely had a few writes but not enough to
cause a journal wrap.)

I note that the warning may well *not* have triggered for /var: if the
reason it had a journal replay was simply that it was still in use by
something that hadn't died, the umount -l will have avoided doing a full
umount for that filesystem alone.

Also, the corrupted filesystem was mounted in 3.6.3 exactly once.
Multiple umounts are not necessary, but an unclean umount apparently is.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Eric Sandeen
On 10/24/2012 02:49 PM, Nix wrote:
> On 24 Oct 2012, Theodore Ts'o spake thusly:
>> Toralf, Nix, if you could try applying this patch (at the end of this
>> message), and let me know how and when the WARN_ON triggers, and if it
>> does, please send the empty_bug_workaround plus the WARN_ON(1) report.
>> I know about the case where a file system is mounted and then
>> immediately unmounted, but we don't think that's the problematic case.
>> If you see any other cases where WARN_ON is triggering, it would be
>> really good to know
> 
> Confirmed, it triggers. Traceback below.
> 



The warn on triggers, but I can't tell - did the corruption still occur
with Ted's patch?

-Eric

> 
> OK. That umount of local filesystems sprayed your added
> empty bug workaround and WARN_ONs so many times that nearly all of them
> scrolled off the screen -- and because syslogd was dead by now and this
> is where my netconsole logs go, they're lost. I suspect every single
> umounted filesystem sprayed one of these (and this happened long before
> any reboot-before-we're-done).
> 
> But I did the old trick of camera-capturing the last one (which was
> probably /boot, which has never got corrupted because I hardly ever
> write anything to it at all). I hope it's more useful than nothing. (I
> can rearrange things to umount /var last, and try again, if you think
> that a specific warning from an fs known to get corrupted is especially
> likely to be valuable.)
> 
> So I see, for one umount at least (and the chunk of the previous one
> that scrolled offscreen is consistent with this):
> 
> jbd2_mark_journal_empty bug workaround (21218, 21219)
> [obscured by light] at fs/jbd2/journal.c:1364 jbd2_mark_journal_empty+06c/0xbd
> ...
> [addresses omitted for sanity: traceback only]
> warn_slowpath_common+0x83/0x9b
> warn_slowpath_null+0x1a/0x1c
> jbd2_mark_journal_empty+06c/0xbd
> jbd2_journal_destroy+0x183/0x20c
> ? abort_exclusive_wait+0x8e/0x8e
> ext4_put_super+0x6c/0x316
> ? evict_inodes+0xe6/0xf1
> generic_shutdown_super+0x59/0xd1
> ? free_vfsmnt+0x18/0x3c
> kill_block_super+0x27/0x6a
> deactivate_locked_super+0x26/0x57
> deactivate_super+0x3f/0x43
> mntput_no_expire+0x134/0x13c
> sys_umount+0x308/0x33a
> system_call_fastpath+0x16/0x1b

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Nix
On 24 Oct 2012, n...@esperi.org.uk uttered the following:
> So, the net effect of this is that normally I get no journal recovery on
> anything at all -- but sometimes, if umounting takes longer than a few
> seconds, I reboot with not everything unmounted, and journal recovery
> kicks in on reboot. My post-test fscks this time suggest that only when
> journal recovery kicks in after rebooting out of 2.6.3 do I see
> corruption. So this is indeed an unclean shutdown journal-replay
> situation: it just happens that I routinely have one or two fses
> uncleanly unmounted when all the rest are cleanly unmounted. This
> perhaps explains the scattershot nature of the corruption I see, and why
> most of my ext4 filesystems get off scot-free.

Note that two umounts are not required: fsck found corruption on /var
after a single boot+shutdown round in 3.6.3+this patch. (It did do a
journal replay on /var first.)

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Nix
On 24 Oct 2012, Theodore Ts'o spake thusly:
> Toralf, Nix, if you could try applying this patch (at the end of this
> message), and let me know how and when the WARN_ON triggers, and if it
> does, please send the empty_bug_workaround plus the WARN_ON(1) report.
> I know about the case where a file system is mounted and then
> immediately unmounted, but we don't think that's the problematic case.
> If you see any other cases where WARN_ON is triggering, it would be
> really good to know

Confirmed, it triggers. Traceback below.


But first, a rather lengthy apology: I did indeed forget something
unusual about my system. In my defence, this is a change I made to my
shutdown scripts many years ago, when umount -l was first introduced
(early 2000s? something like that). So it's not surprising I forgot
about it until I needed to add sleeps to it to capture the tracebacks
below. It is really ugly. You may need a sick bag. In brief: some of my
filesystems will sometimes be uncleanly unmounted and experience journal
replay even on clean shutdowns, and which it is will vary unpredictably.


Some of my machines have fairly intricate webs of NFS-mounted and
non-NFS-mounted filesystems, and I expect them all to reboot
successfully if commanded remotely, because sometimes I'm hundreds of
miles away when I do it and can hardly hit the reset button.

Unfortunately, if I have a mount structure like this:

/usr local
/usr/foo NFS-mounted (may be loopback-NFS-mounted)
/usr/foo/bar local

and /usr/foo is down, any attempt to umount /usr/foo/bar will hang
indefinitely. Worse yet, if I umount the nfs filesystem, the local fs
isn't going to be reachable either -- but umounting nfs filesystems has
to happen first so I can killall everything (which would include e.g.
rpc.statd and rpc.nfsd) in order to free up the local filesystems for
umount.

The only way I could see to fix this is to umount -l everything rather
than umounting it (sure, I could do some sort of NFS-versus-non-NFS
analysis and only do this to some filesystems, but testing this
complexity for the -- for me -- rare case of system shutdown was too
annoying to consider). I consider a hang on shutdown much worse than an
occasional unclean umount, because all my filesystems are journalled so
journal recovery will make everything quite happy.

So I do

sync
umount -a -l -t nfs & sleep 2
killall5 -15
killall5 -9
exportfs -ua
quotaoff -a
swapoff -a
LANG=C sort -r -k 2 /proc/mounts | \
(DIRS=""
 while read DEV DIR TYPE REST; do
 case "$DIR" in
 /|/proc|/dev|/proc/*|/sys)
 continue;; # Ignoring virtual file systems needed later
 esac

 case $TYPE in
 proc|procfs|sysfs|usbfs|usbdevfs|devpts)
 continue;; # Ignoring non-tmpfs virtual file systems
 esac
 DIRS="$DIRS $DIR"
done
umount -l -r -d $DIRS) # rely on mount's toposort
sleep 2

The net effect of this being to cleanly umount everything whose mount
points are reachable and which unmounts cleanly in less than a couple of
seconds, and to leave the rest mounted and let journal recovery handle
them. This is clearly really horrible -- I'd far prefer to say 'sleep
until filesystems have finished doing I/O' or better have mount just not
return from mount(8) unless that is true. But this isn't available, and
even it was some fses would still be left to journal recovery, so I
kludged it -- and then forgot about doing anything to improve the
situation for many years.

So, the net effect of this is that normally I get no journal recovery on
anything at all -- but sometimes, if umounting takes longer than a few
seconds, I reboot with not everything unmounted, and journal recovery
kicks in on reboot. My post-test fscks this time suggest that only when
journal recovery kicks in after rebooting out of 2.6.3 do I see
corruption. So this is indeed an unclean shutdown journal-replay
situation: it just happens that I routinely have one or two fses
uncleanly unmounted when all the rest are cleanly unmounted. This
perhaps explains the scattershot nature of the corruption I see, and why
most of my ext4 filesystems get off scot-free.

I'll wait for a minute until you're finished projectile-vomiting. (And
if you have suggestions for making the case of nested local/rewmote
filesystems work without rebooting while umounts may still be in
progress, or even better suggestions to allow me to umount mounts that
happen to be mounted below NFS-mounted mounts with dead or nonresponsive
NFS server, I'd be glad to hear them! Distros appear to take the
opposite tack, and prefer to simply lock up forever waiting for a
nonresponsive NFS server in this situation. I could never accept that.)


[...]

OK. That umount of local filesystems sprayed your added
empty bug workaround and WARN_ONs so many times that nearly all of them
scrolled off the screen -- and because syslogd was dead by now and this
is where my netconsole logs go, they're lost. I suspect every single
umounted filesystem 

Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Jannis Achstetter
Am 24.10.2012 00:19, schrieb Theodore Ts'o:
> [...]
> The reason why the problem happens rarely is that the effect of the
> buggy commit is that if the journal's starting block is zero, we fail
> to truncate the journal when we unmount the file system.  This can
> happen if we mount and then unmount the file system fairly quickly,
> before the log has a chance to wrap.  After the first time this has
> happened, it's not a disaster, since when we replay the journal, we'll
> just replay some extra transactions.  But if this happens twice, the
> oldest valid transaction will still not have gotten updated, but some
> of the newer transactions from the last mount session will have gotten
> written by the very latest transacitons, and when we then try to do
> the extra transaction replays, the metadata blocks can end up getting
> very scrambled indeed.
> [...]

As a "normal linux user" I'm interested in the practical things to do
now to avoid data loss. I'm running several systems with 3.6.2 and ext4.
Fearing loss of data:
- Is there a way to see whether the journal of a specific partition has
been wrapped (since mounting) so that umounting and mounting (or doing a
reboot to downgrade the kernel) is safe?
- Is there a way to "force" a journal-wrap? Run any
filesystem-benchmark? Which one with what parameters? Or is it unwise
since I might even further corrupt data if I hit the case already?
- Is it wise to umount now and run e2fsck or might I corrupt my files
just by umounting now if the journal hasn't wrapped yet?
- How do you define "fairly quickly"? Of course servers run 24/7 but I
might be using my PC 2-5 hrs a day... Is that a "reboot to soon after
booting"?
- Any more advice you can give to the ordinary user to avoid
fs-corruption? Don't shut down machines for some days? Better down- or
upgrade the kernel?

Best regards,
Jannis Achstetter


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Martin

On 10/24/2012 01:40 AM, Nix wrote:


It's true that in less than a week
probably not all that many people have rebooted often enough to trip
over this.

I hope.



Fwiw, i got a fried root filesystem (ext4) on one machine last week. It 
was on 3.5.3 or 3.5.5. Since there was nothing in the logs and the 
kernel was modified (CK, BFQ) and tainted (nvidia) I did not notify any 
maintainers. I have not had the time yet to rebuild the machine 
(unfortunately that will be laboursome), so the users cannot do their 
homework or attend to their social life for the time being...


The pattern was indeed characterized by a sequence of reboots (I am 
told), and in a weird fashion files started to disappear from the root 
filesystem (I first noticed /etc/groups missing, and after further fscks 
and reboots login became impossible (I assume that /etc/passwd and or 
/etc/shadow are buggered). I haven't assessed the extent of the damage yet.


Still not sure whether it is related to the bug in question, of course.

Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Eric Sandeen
On 10/24/2012 12:23 AM, Theodore Ts'o wrote:
> On Tue, Oct 23, 2012 at 11:27:09PM -0500, Eric Sandeen wrote:
>>
>> Ok, fair enough.  If the BBU is working, nobarrier is ok; I don't trust
>> journal_async_commit, but that doesn't mean this isn't a regression.
> 
> Note that Toralf has reported almost exactly the same set of symptoms,
> but he's using an external USB stick --- and as far as I know he
> wasn't using nobarrier and/or the journal_async_commit.  Toralf, can
> you confirm what, if any, mount options you were using when you saw
> it.
> 
> I've been looking at this some more, and there's one other thing that
> the short circuit code does, which is neglects setting the
> JBD2_FLUSHED flag, which is used by the commit code to know when it
> needs to reset the s_start fields in the superblock when we make our
> next commit.  However, this would only happen if the short circuit
> code is getting hit some time other than when the file system is
> getting unmounted --- and that's what Eric and I can't figure out how
> it might be happening.  Journal flushes outside of an unmount does
> happen as part of online resizing, the FIBMAP ioctl, or when the file
> system is frozen.  But it didn't sound like Toralf or Nix was using
> any of those features.  (Toralf, Nix, please correct me if my
> assumptions here is wrong).

If I freeze w/ anything in the log, then s_start !=0 and we proceed
normally.  If I re-freeze w/o anything in the log, it's already set to
FLUSHED (which makes sense) so not re-setting it doesn't matter.  So I
don't see that that's an issue.

As for FIBMAP I think we only do journal_flush if it's data=journal.

In other news, Phoronix is on the case, so expect escalating freakouts ;)

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Nix
On 24 Oct 2012, Hugh Dickins verbalised:

> On Wed, 24 Oct 2012, Theodore Ts'o wrote:
>> Journal flushes outside of an unmount does
>> happen as part of online resizing, the FIBMAP ioctl, or when the file
>> system is frozen.  But it didn't sound like Toralf or Nix was using
>> any of those features.  (Toralf, Nix, please correct me if my
>> assumptions here is wrong).
>
> I believe it also happens at swapon of a swapfile on the filesystem.

I'm not using swapfiles, only swap partitions (on separate LVM LVs).
So that's not it either.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Nix
On 24 Oct 2012, Theodore Ts'o stated:

> Journal flushes outside of an unmount does
> happen as part of online resizing, the FIBMAP ioctl, or when the file
> system is frozen.  But it didn't sound like Toralf or Nix was using
> any of those features.

Quite so -- the corrupted filesystems have space reserved for resizing,
and one of them has been resized, years ago, but I haven't resized
either of them with this kernel, or with any kernel numbered 3.x for
that matter.

> Toralf, Nix, if you could try applying this patch (at the end of this
> message), and let me know how and when the WARN_ON triggers, and if it
> does, please send the empty_bug_workaround plus the WARN_ON(1) report.
> I know about the case where a file system is mounted and then
> immediately unmounted, but we don't think that's the problematic case.
> If you see any other cases where WARN_ON is triggering, it would be
> really good to know

I'll give it a test later today, after another backup has finished.
Daily backups are normally overkill, but I don't think they are right
now.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Hugh Dickins
On Wed, 24 Oct 2012, Theodore Ts'o wrote:
> Journal flushes outside of an unmount does
> happen as part of online resizing, the FIBMAP ioctl, or when the file
> system is frozen.  But it didn't sound like Toralf or Nix was using
> any of those features.  (Toralf, Nix, please correct me if my
> assumptions here is wrong).

I believe it also happens at swapon of a swapfile on the filesystem.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Hugh Dickins
On Wed, 24 Oct 2012, Theodore Ts'o wrote:
 Journal flushes outside of an unmount does
 happen as part of online resizing, the FIBMAP ioctl, or when the file
 system is frozen.  But it didn't sound like Toralf or Nix was using
 any of those features.  (Toralf, Nix, please correct me if my
 assumptions here is wrong).

I believe it also happens at swapon of a swapfile on the filesystem.

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Nix
On 24 Oct 2012, Theodore Ts'o stated:

 Journal flushes outside of an unmount does
 happen as part of online resizing, the FIBMAP ioctl, or when the file
 system is frozen.  But it didn't sound like Toralf or Nix was using
 any of those features.

Quite so -- the corrupted filesystems have space reserved for resizing,
and one of them has been resized, years ago, but I haven't resized
either of them with this kernel, or with any kernel numbered 3.x for
that matter.

 Toralf, Nix, if you could try applying this patch (at the end of this
 message), and let me know how and when the WARN_ON triggers, and if it
 does, please send the empty_bug_workaround plus the WARN_ON(1) report.
 I know about the case where a file system is mounted and then
 immediately unmounted, but we don't think that's the problematic case.
 If you see any other cases where WARN_ON is triggering, it would be
 really good to know

I'll give it a test later today, after another backup has finished.
Daily backups are normally overkill, but I don't think they are right
now.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Nix
On 24 Oct 2012, Hugh Dickins verbalised:

 On Wed, 24 Oct 2012, Theodore Ts'o wrote:
 Journal flushes outside of an unmount does
 happen as part of online resizing, the FIBMAP ioctl, or when the file
 system is frozen.  But it didn't sound like Toralf or Nix was using
 any of those features.  (Toralf, Nix, please correct me if my
 assumptions here is wrong).

 I believe it also happens at swapon of a swapfile on the filesystem.

I'm not using swapfiles, only swap partitions (on separate LVM LVs).
So that's not it either.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Eric Sandeen
On 10/24/2012 12:23 AM, Theodore Ts'o wrote:
 On Tue, Oct 23, 2012 at 11:27:09PM -0500, Eric Sandeen wrote:

 Ok, fair enough.  If the BBU is working, nobarrier is ok; I don't trust
 journal_async_commit, but that doesn't mean this isn't a regression.
 
 Note that Toralf has reported almost exactly the same set of symptoms,
 but he's using an external USB stick --- and as far as I know he
 wasn't using nobarrier and/or the journal_async_commit.  Toralf, can
 you confirm what, if any, mount options you were using when you saw
 it.
 
 I've been looking at this some more, and there's one other thing that
 the short circuit code does, which is neglects setting the
 JBD2_FLUSHED flag, which is used by the commit code to know when it
 needs to reset the s_start fields in the superblock when we make our
 next commit.  However, this would only happen if the short circuit
 code is getting hit some time other than when the file system is
 getting unmounted --- and that's what Eric and I can't figure out how
 it might be happening.  Journal flushes outside of an unmount does
 happen as part of online resizing, the FIBMAP ioctl, or when the file
 system is frozen.  But it didn't sound like Toralf or Nix was using
 any of those features.  (Toralf, Nix, please correct me if my
 assumptions here is wrong).

If I freeze w/ anything in the log, then s_start !=0 and we proceed
normally.  If I re-freeze w/o anything in the log, it's already set to
FLUSHED (which makes sense) so not re-setting it doesn't matter.  So I
don't see that that's an issue.

As for FIBMAP I think we only do journal_flush if it's data=journal.

In other news, Phoronix is on the case, so expect escalating freakouts ;)

-Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Martin

On 10/24/2012 01:40 AM, Nix wrote:


It's true that in less than a week
probably not all that many people have rebooted often enough to trip
over this.

I hope.



Fwiw, i got a fried root filesystem (ext4) on one machine last week. It 
was on 3.5.3 or 3.5.5. Since there was nothing in the logs and the 
kernel was modified (CK, BFQ) and tainted (nvidia) I did not notify any 
maintainers. I have not had the time yet to rebuild the machine 
(unfortunately that will be laboursome), so the users cannot do their 
homework or attend to their social life for the time being...


The pattern was indeed characterized by a sequence of reboots (I am 
told), and in a weird fashion files started to disappear from the root 
filesystem (I first noticed /etc/groups missing, and after further fscks 
and reboots login became impossible (I assume that /etc/passwd and or 
/etc/shadow are buggered). I haven't assessed the extent of the damage yet.


Still not sure whether it is related to the bug in question, of course.

Martin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Jannis Achstetter
Am 24.10.2012 00:19, schrieb Theodore Ts'o:
 [...]
 The reason why the problem happens rarely is that the effect of the
 buggy commit is that if the journal's starting block is zero, we fail
 to truncate the journal when we unmount the file system.  This can
 happen if we mount and then unmount the file system fairly quickly,
 before the log has a chance to wrap.  After the first time this has
 happened, it's not a disaster, since when we replay the journal, we'll
 just replay some extra transactions.  But if this happens twice, the
 oldest valid transaction will still not have gotten updated, but some
 of the newer transactions from the last mount session will have gotten
 written by the very latest transacitons, and when we then try to do
 the extra transaction replays, the metadata blocks can end up getting
 very scrambled indeed.
 [...]

As a normal linux user I'm interested in the practical things to do
now to avoid data loss. I'm running several systems with 3.6.2 and ext4.
Fearing loss of data:
- Is there a way to see whether the journal of a specific partition has
been wrapped (since mounting) so that umounting and mounting (or doing a
reboot to downgrade the kernel) is safe?
- Is there a way to force a journal-wrap? Run any
filesystem-benchmark? Which one with what parameters? Or is it unwise
since I might even further corrupt data if I hit the case already?
- Is it wise to umount now and run e2fsck or might I corrupt my files
just by umounting now if the journal hasn't wrapped yet?
- How do you define fairly quickly? Of course servers run 24/7 but I
might be using my PC 2-5 hrs a day... Is that a reboot to soon after
booting?
- Any more advice you can give to the ordinary user to avoid
fs-corruption? Don't shut down machines for some days? Better down- or
upgrade the kernel?

Best regards,
Jannis Achstetter


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Nix
On 24 Oct 2012, Theodore Ts'o spake thusly:
 Toralf, Nix, if you could try applying this patch (at the end of this
 message), and let me know how and when the WARN_ON triggers, and if it
 does, please send the empty_bug_workaround plus the WARN_ON(1) report.
 I know about the case where a file system is mounted and then
 immediately unmounted, but we don't think that's the problematic case.
 If you see any other cases where WARN_ON is triggering, it would be
 really good to know

Confirmed, it triggers. Traceback below.


But first, a rather lengthy apology: I did indeed forget something
unusual about my system. In my defence, this is a change I made to my
shutdown scripts many years ago, when umount -l was first introduced
(early 2000s? something like that). So it's not surprising I forgot
about it until I needed to add sleeps to it to capture the tracebacks
below. It is really ugly. You may need a sick bag. In brief: some of my
filesystems will sometimes be uncleanly unmounted and experience journal
replay even on clean shutdowns, and which it is will vary unpredictably.


Some of my machines have fairly intricate webs of NFS-mounted and
non-NFS-mounted filesystems, and I expect them all to reboot
successfully if commanded remotely, because sometimes I'm hundreds of
miles away when I do it and can hardly hit the reset button.

Unfortunately, if I have a mount structure like this:

/usr local
/usr/foo NFS-mounted (may be loopback-NFS-mounted)
/usr/foo/bar local

and /usr/foo is down, any attempt to umount /usr/foo/bar will hang
indefinitely. Worse yet, if I umount the nfs filesystem, the local fs
isn't going to be reachable either -- but umounting nfs filesystems has
to happen first so I can killall everything (which would include e.g.
rpc.statd and rpc.nfsd) in order to free up the local filesystems for
umount.

The only way I could see to fix this is to umount -l everything rather
than umounting it (sure, I could do some sort of NFS-versus-non-NFS
analysis and only do this to some filesystems, but testing this
complexity for the -- for me -- rare case of system shutdown was too
annoying to consider). I consider a hang on shutdown much worse than an
occasional unclean umount, because all my filesystems are journalled so
journal recovery will make everything quite happy.

So I do

sync
umount -a -l -t nfs  sleep 2
killall5 -15
killall5 -9
exportfs -ua
quotaoff -a
swapoff -a
LANG=C sort -r -k 2 /proc/mounts | \
(DIRS=
 while read DEV DIR TYPE REST; do
 case $DIR in
 /|/proc|/dev|/proc/*|/sys)
 continue;; # Ignoring virtual file systems needed later
 esac

 case $TYPE in
 proc|procfs|sysfs|usbfs|usbdevfs|devpts)
 continue;; # Ignoring non-tmpfs virtual file systems
 esac
 DIRS=$DIRS $DIR
done
umount -l -r -d $DIRS) # rely on mount's toposort
sleep 2

The net effect of this being to cleanly umount everything whose mount
points are reachable and which unmounts cleanly in less than a couple of
seconds, and to leave the rest mounted and let journal recovery handle
them. This is clearly really horrible -- I'd far prefer to say 'sleep
until filesystems have finished doing I/O' or better have mount just not
return from mount(8) unless that is true. But this isn't available, and
even it was some fses would still be left to journal recovery, so I
kludged it -- and then forgot about doing anything to improve the
situation for many years.

So, the net effect of this is that normally I get no journal recovery on
anything at all -- but sometimes, if umounting takes longer than a few
seconds, I reboot with not everything unmounted, and journal recovery
kicks in on reboot. My post-test fscks this time suggest that only when
journal recovery kicks in after rebooting out of 2.6.3 do I see
corruption. So this is indeed an unclean shutdown journal-replay
situation: it just happens that I routinely have one or two fses
uncleanly unmounted when all the rest are cleanly unmounted. This
perhaps explains the scattershot nature of the corruption I see, and why
most of my ext4 filesystems get off scot-free.

I'll wait for a minute until you're finished projectile-vomiting. (And
if you have suggestions for making the case of nested local/rewmote
filesystems work without rebooting while umounts may still be in
progress, or even better suggestions to allow me to umount mounts that
happen to be mounted below NFS-mounted mounts with dead or nonresponsive
NFS server, I'd be glad to hear them! Distros appear to take the
opposite tack, and prefer to simply lock up forever waiting for a
nonresponsive NFS server in this situation. I could never accept that.)


[...]

OK. That umount of local filesystems sprayed your added
empty bug workaround and WARN_ONs so many times that nearly all of them
scrolled off the screen -- and because syslogd was dead by now and this
is where my netconsole logs go, they're lost. I suspect every single
umounted filesystem sprayed one of 

Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Nix
On 24 Oct 2012, n...@esperi.org.uk uttered the following:
 So, the net effect of this is that normally I get no journal recovery on
 anything at all -- but sometimes, if umounting takes longer than a few
 seconds, I reboot with not everything unmounted, and journal recovery
 kicks in on reboot. My post-test fscks this time suggest that only when
 journal recovery kicks in after rebooting out of 2.6.3 do I see
 corruption. So this is indeed an unclean shutdown journal-replay
 situation: it just happens that I routinely have one or two fses
 uncleanly unmounted when all the rest are cleanly unmounted. This
 perhaps explains the scattershot nature of the corruption I see, and why
 most of my ext4 filesystems get off scot-free.

Note that two umounts are not required: fsck found corruption on /var
after a single boot+shutdown round in 3.6.3+this patch. (It did do a
journal replay on /var first.)

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Eric Sandeen
On 10/24/2012 02:49 PM, Nix wrote:
 On 24 Oct 2012, Theodore Ts'o spake thusly:
 Toralf, Nix, if you could try applying this patch (at the end of this
 message), and let me know how and when the WARN_ON triggers, and if it
 does, please send the empty_bug_workaround plus the WARN_ON(1) report.
 I know about the case where a file system is mounted and then
 immediately unmounted, but we don't think that's the problematic case.
 If you see any other cases where WARN_ON is triggering, it would be
 really good to know
 
 Confirmed, it triggers. Traceback below.
 

giant snip

The warn on triggers, but I can't tell - did the corruption still occur
with Ted's patch?

-Eric

 
 OK. That umount of local filesystems sprayed your added
 empty bug workaround and WARN_ONs so many times that nearly all of them
 scrolled off the screen -- and because syslogd was dead by now and this
 is where my netconsole logs go, they're lost. I suspect every single
 umounted filesystem sprayed one of these (and this happened long before
 any reboot-before-we're-done).
 
 But I did the old trick of camera-capturing the last one (which was
 probably /boot, which has never got corrupted because I hardly ever
 write anything to it at all). I hope it's more useful than nothing. (I
 can rearrange things to umount /var last, and try again, if you think
 that a specific warning from an fs known to get corrupted is especially
 likely to be valuable.)
 
 So I see, for one umount at least (and the chunk of the previous one
 that scrolled offscreen is consistent with this):
 
 jbd2_mark_journal_empty bug workaround (21218, 21219)
 [obscured by light] at fs/jbd2/journal.c:1364 jbd2_mark_journal_empty+06c/0xbd
 ...
 [addresses omitted for sanity: traceback only]
 warn_slowpath_common+0x83/0x9b
 warn_slowpath_null+0x1a/0x1c
 jbd2_mark_journal_empty+06c/0xbd
 jbd2_journal_destroy+0x183/0x20c
 ? abort_exclusive_wait+0x8e/0x8e
 ext4_put_super+0x6c/0x316
 ? evict_inodes+0xe6/0xf1
 generic_shutdown_super+0x59/0xd1
 ? free_vfsmnt+0x18/0x3c
 kill_block_super+0x27/0x6a
 deactivate_locked_super+0x26/0x57
 deactivate_super+0x3f/0x43
 mntput_no_expire+0x134/0x13c
 sys_umount+0x308/0x33a
 system_call_fastpath+0x16/0x1b

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Nix
On 24 Oct 2012, Eric Sandeen uttered the following:

 On 10/24/2012 02:49 PM, Nix wrote:
 On 24 Oct 2012, Theodore Ts'o spake thusly:
 Toralf, Nix, if you could try applying this patch (at the end of this
 message), and let me know how and when the WARN_ON triggers, and if it
 does, please send the empty_bug_workaround plus the WARN_ON(1) report.
 I know about the case where a file system is mounted and then
 immediately unmounted, but we don't think that's the problematic case.
 If you see any other cases where WARN_ON is triggering, it would be
 really good to know
 
 Confirmed, it triggers. Traceback below.

 giant snip

 The warn on triggers, but I can't tell - did the corruption still occur
 with Ted's patch?

Yes. I fscked the filesystems in 3.6.1 after rebooting: /var had a
journal replay, and the usual varieties of corruption (free space bitmap
problems and multiply-claimed blocks). (The other filesystems for which
the warning triggered had neither a journal replay nor corruption.
At least one of them, /home, likely had a few writes but not enough to
cause a journal wrap.)

I note that the warning may well *not* have triggered for /var: if the
reason it had a journal replay was simply that it was still in use by
something that hadn't died, the umount -l will have avoided doing a full
umount for that filesystem alone.

Also, the corrupted filesystem was mounted in 3.6.3 exactly once.
Multiple umounts are not necessary, but an unclean umount apparently is.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Nix
On 24 Oct 2012, n...@esperi.org.uk spake thusly:
 So, the net effect of this is that normally I get no journal recovery on
 anything at all -- but sometimes, if umounting takes longer than a few
 seconds, I reboot with not everything unmounted, and journal recovery
 kicks in on reboot.

It occurs to me that it is possible that this bug hits only those
filesystems for which a umount has started but been unable to complete.
If so, this is a relatively rare and unimportant bug which probably hits
only me and users of slow removable filesystems in the whole world...

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Jannis Achstetter
Am 24.10.2012 00:19, schrieb Theodore Ts'o:
 The reason why the problem happens rarely is that the effect of the
 buggy commit is that if the journal's starting block is zero, we fail
 to truncate the journal when we unmount the file system.  This can
 happen if we mount and then unmount the file system fairly quickly,
 before the log has a chance to wrap.  After the first time this has
 happened, it's not a disaster, since when we replay the journal, we'll
 just replay some extra transactions.  But if this happens twice, the
 oldest valid transaction will still not have gotten updated, but some
 of the newer transactions from the last mount session will have gotten
 written by the very latest transacitons, and when we then try to do
 the extra transaction replays, the metadata blocks can end up getting
 very scrambled indeed.

Repost. Sorry, I don't mean to spam, I just don't see my first mail
(sent via gmane.org) anywhere, so ...

As a normal linux user I'm interested in the practical things to do
now to avoid data loss. I'm running several systems with 3.6.2 and ext4.
Fearing loss of data:
- Is there a way to see whether the journal of a specific partition has
been wrapped (since mounting) so that umounting and mounting (or doing a
reboot to downgrade the kernel) is safe?
- Is there a way to force a journal-wrap? Run any
filesystem-benchmark? Which one with what parameters? Or is it unwise
since I might even further corrupt data if I hit the case already?
- Is it wise to umount now and run e2fsck or might I corrupt my files
just by umounting now if the journal hasn't wrapped yet?
- How do you define fairly quickly? Of course servers run 24/7 but I
might be using my PC 2-5 hrs a day... Is that a reboot to soon after
booting?
- Any more advice you can give to the ordinary user to avoid
fs-corruption? Don't shut down machines for some days? Better down- or
upgrade the kernel?

Best regards,
Jannis Achstetter

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Theodore Ts'o
On Wed, Oct 24, 2012 at 09:45:47PM +0100, Nix wrote:
 
 It occurs to me that it is possible that this bug hits only those
 filesystems for which a umount has started but been unable to complete.
 If so, this is a relatively rare and unimportant bug which probably hits
 only me and users of slow removable filesystems in the whole world...

Can you verify this?  Does the bug show up if you just hit the power
switch while the system is booted?

How about changing the sleep 2 to sleep 0.5?  (Feel free to
unmount your other partitions, and just leave a test file system
mounted to minimize the chances that you lose partitions that require
hours and hours to restore...)

If you can get a very reliable repro, we might have to ask you to try
the following experiments:

0) Make sure the reliable repro does _not_ work with 3.6.1 booted

1) Try a 3.6.2 kernel

2) (If the problem shows up above) try a 3.6.2 kernel with 14b4ed2 reverted

3) (If the problem shows up above) try a 3.6.2 kernel with all of ext4
   related patches reverted:
92b7722 ext4: fix mtime update in nodelalloc mode
34414b2 ext4: fix fdatasync() for files with only i_size changes
12ebdf0 ext4: always set i_op in ext4_mknod()
22a5672 ext4: online defrag is not supported for journaled files
ba57d9e ext4: move_extent code cleanup
2fdb112 ext4: fix crash when accessing /proc/mounts concurrently
1638f1f ext4: fix potential deadlock in ext4_nonda_switch()
5018ddd ext4: avoid duplicate writes of the backup bg descriptor blocks
256ae46 ext4: don't copy non-existent gdt blocks when resizing
416a688 ext4: ignore last group w/o enough space when resizing instead of 
BUG'ing
14b4ed2 jbd2: don't write superblock when if its empty

4) (If the problem still shows up) then we may need to do a full
   bisect to figure out what is going on

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Theodore Ts'o
On Wed, Oct 24, 2012 at 09:13:01PM +0200, Jannis Achstetter wrote:
 
 As a normal linux user I'm interested in the practical things to do
 now to avoid data loss. I'm running several systems with 3.6.2 and ext4.
 Fearing loss of data:
 - Is there a way to see whether the journal of a specific partition has
 been wrapped (since mounting) so that umounting and mounting (or doing a
 reboot to downgrade the kernel) is safe?

My initial analysis of what had been causing the problem now looks
incorrect (or at least incomplete).  Both Eric and I have been unable
to reproduce the failure based on my initial theory of what had been
going on.  So the best information at this point is that it's probably
not related to the file system getting unmounted before the journal
has wrapped.

(Keep in mind this is why commercial software corporations like
Microsoft or Apple generally don't make discussions as they are trying
to root cause a problem public; sometimes the initial theories can be
incorrect, and it's unfortunate when misinformation ends up on
Phoronix or Slashdot, leading to people to panic...  but this is open
source, so that means we do everything in the open, since that way we
can all work towards finding the best answer.)

At the *moment* it looks like it might be related to an unclean
shutdown (i.e., a forced reset or power failure while the file system
is mounted or is in the process of being unmounted).  That being said,
a simply kill -9 of kvm running a test kernel while the file system is
mounted by otherwise quiscient doesn't trigger the problem (I was
trying that last night).

It's a little bit too early for this meme:

http://memegenerator.net/instance/28936247

But do please note that that Fedora !7 users have been using 3.6.2 for
a while, so if this were an easily triggered bug, (a) Eric and I would
have managed to reproduce it by now, and (b) lots of people would be
complaining, since the symptoms of the bug are not subtle.

That's not to say we aren't treating this seriously; but people
shouldn't panic unduly (and if you are using a critical
enterprise/production server on bleeding edge kernels, may I suggest
that this might not be such a good idea; there is a *reason* why
enterprise Linux distro's spend 6-9 months or more just stablizing the
kernel, and being super paranoid about making changes afterwards for
years, and it's not because they enjoy backporting patches and working
with trailing edge kernel sources.  :-)

Regards,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Jannis Achstetter
Am 24.10.2012 23:31, schrieb Theodore Ts'o:
 On Wed, Oct 24, 2012 at 09:13:01PM +0200, Jannis Achstetter wrote:

 As a normal linux user I'm interested in the practical things to do
 now to avoid data loss. I'm running several systems with 3.6.2 and ext4.
 Fearing loss of data:
 - Is there a way to see whether the journal of a specific partition has
 been wrapped (since mounting) so that umounting and mounting (or doing a
 reboot to downgrade the kernel) is safe?
 [...]
 (Keep in mind this is why commercial software corporations like
 [...]
 can all work towards finding the best answer.)

I really appreciate this and I like it since although the root-cause
hasn't been found for sure yet, it is a transparent process.
And it's great good thing that we can directly talk to the involved devs
w/o going through 200 layers of marketing and spokesmen (as it were with
the two companies you mentioned).

 It's a little bit too early for this meme:
 http://memegenerator.net/instance/28936247

That's a good one :)

 But do please note that that Fedora !7 users have been using 3.6.2 for
 [...]
 with trailing edge kernel sources.  :-)

Yes, the downside of running Gentoo unstable. But even the stable tree
used 3.5.7 and this is the one my NAS uses where I do store my backups.
Nevertheless, your reply eased my mind to a great extend and I'm
thankful for it.
Time for bed now :)

Jannis

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Nix
On 24 Oct 2012, Theodore Ts'o uttered the following:
 (Keep in mind this is why commercial software corporations like
 Microsoft or Apple generally don't make discussions as they are trying
 to root cause a problem public; sometimes the initial theories can be
 incorrect, and it's unfortunate when misinformation ends up on
 Phoronix or Slashdot, leading to people to panic...  but this is open
 source, so that means we do everything in the open, since that way we
 can all work towards finding the best answer.)

Quite. The first few days of any problem diagnosis are often a process
of taking something from 'oh my god it might be the end of the world' to
'oh look it's really obscure, no wonder nobody has ever seen it before'.

This is quite *definitely* such a problem.

 It's a little bit too early for this meme:

 http://memegenerator.net/instance/28936247

It appears I have taken up a new post as the Iraqi Information Minister.
This is why I was disturbed to see the thing hitting Phoronix and then
Slashdot: as the guy whose FSes are being eaten, this is probably not an
easy bug to hit! If it hits, the consequences are serious, but it
doesn't seem to be easy to hit. (I should perhaps have phrased the
subject line better, but I'd just had my $HOME eaten and was rather
stressed out...)

 But do please note that that Fedora !7 users have been using 3.6.2 for
 a while, so if this were an easily triggered bug, (a) Eric and I would
 have managed to reproduce it by now, and (b) lots of people would be
 complaining, since the symptoms of the bug are not subtle.

Quite.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-23 Thread Theodore Ts'o
On Tue, Oct 23, 2012 at 11:27:09PM -0500, Eric Sandeen wrote:
> 
> Ok, fair enough.  If the BBU is working, nobarrier is ok; I don't trust
> journal_async_commit, but that doesn't mean this isn't a regression.

Note that Toralf has reported almost exactly the same set of symptoms,
but he's using an external USB stick --- and as far as I know he
wasn't using nobarrier and/or the journal_async_commit.  Toralf, can
you confirm what, if any, mount options you were using when you saw
it.

I've been looking at this some more, and there's one other thing that
the short circuit code does, which is neglects setting the
JBD2_FLUSHED flag, which is used by the commit code to know when it
needs to reset the s_start fields in the superblock when we make our
next commit.  However, this would only happen if the short circuit
code is getting hit some time other than when the file system is
getting unmounted --- and that's what Eric and I can't figure out how
it might be happening.  Journal flushes outside of an unmount does
happen as part of online resizing, the FIBMAP ioctl, or when the file
system is frozen.  But it didn't sound like Toralf or Nix was using
any of those features.  (Toralf, Nix, please correct me if my
assumptions here is wrong).

So here's a replacement patch which essentially restores the effects
of eeecef0af5e while still keeping the optimization and fixing the
read/only testing issue which eeecef0af5e is trying to fix up.  It
also have a debugging printk that will trigger so we can perhaps have
a better chance of figuring out what might be going on.

Toralf, Nix, if you could try applying this patch (at the end of this
message), and let me know how and when the WARN_ON triggers, and if it
does, please send the empty_bug_workaround plus the WARN_ON(1) report.
I know about the case where a file system is mounted and then
immediately unmounted, but we don't think that's the problematic case.
If you see any other cases where WARN_ON is triggering, it would be
really good to know

  - Ted

P.S.  This is a list of all of the commits between v3.6.1 and v3.6.2
(there were no ext4-related changes between v3.6.2 and v3.6.3), and a
quick analysis of the patch.  The last commit, 14b4ed2, is the only
one that I could see as potentially being problematic, which is why
I've been pushing so hard on this one even though my original analysis
doesn't seem to be correct, and Eric and I can't see how the change in
14b4ed2 could be causing the fs corruption.


Online Defrag
=
22a5672 ext4: online defrag is not supported for journaled files
ba57d9e ext4: move_extent code cleanup
   No behavioral change unless e4defrag has been used.

Online Resize
=
5018ddd ext4: avoid duplicate writes of the backup bg descriptor blocks
256ae46 ext4: don't copy non-existent gdt blocks when resizing
416a688 ext4: ignore last group w/o enough space when resizing instead of 
BUG'ing
   No observable change unless online resizing (e2resize) has been used

Other Commits
=
92b7722 ext4: fix mtime update in nodelalloc mode
   Changes where we call file_update_time()

34414b2 ext4: fix fdatasync() for files with only i_size changes
   Forces the inode changes to be commited if only i_sync changes when
   fdatasync() is called.  No changes except performance impact
   to fdatasync() and correctness after a system crash.

12ebdf0 ext4: always set i_op in ext4_mknod()
   Fixes a bug if CONFIG_EXT4_FS_XATTR is not defined;
   no change if CONFIG_EXT4_FS_XATTR is defined

2fdb112 ext4: fix crash when accessing /proc/mounts concurrently
   Remove an erroneous "static" for an function so it is allocated on the stack;
   fixes a bug if two processes cat /proc/mounts at the same time

1638f1f ext4: fix potential deadlock in ext4_nonda_switch()
   Fixes a circular lock dependency

14b4ed2 jbd2: don't write superblock when if its empty
   If journal->s_start is zero, we may not update journal->s_sequence when
   it might be needed.  (But we at the moement we can't see how this could
   lead to the reported fs corruptions.)


commit cb57108637e01ec2f02d9311cedc3013e96f25d4
Author: Theodore Ts'o 
Date:   Wed Oct 24 01:01:41 2012 -0400

jbd2: fix a potential fs corrupting bug in jbd2_mark_journal_empty

Fix a potential file system corrupting bug which was introduced by
commit eeecef0af5ea4efd763c9554cf2bd80fc4a0efd3: jbd2: don't write
superblock when if its empty.

We should only skip writing the journal superblock if there is nothing
to do --- not just when s_start is zero.

This has caused users to report file system corruptions in ext4 that
look like this:

EXT4-fs error (device sdb3): ext4_mb_generate_buddy:741: group 436, 22902 
clusters in bitmap, 22901 in gd
JBD2: Spotted dirty metadata buffer (dev = sdb3, blocknr = 0). There's a 
risk of filesystem corruption in case of system crash.

after the file system has been 

Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-23 Thread Eric Sandeen
On 10/23/12 11:15 PM, Nix wrote:
> On 24 Oct 2012, Eric Sandeen uttered the following:
> 
>> On 10/23/12 3:57 PM, Nix wrote:
>>> The only unusual thing about the filesystems on this machine are that
>>> they have hardware RAID-5 (using the Areca driver), so I'm mounting with
>>> 'nobarrier': 
>>
>> I should have read more.  :(  More questions follow:
>>
>> * Does the Areca have a battery backed write cache?
> 
> Yes (though I'm not powering off, just rebooting). Battery at 100% and
> happy, though the lack of power-off means it's not actually getting
> used, since the cache is obviously mains-backed as well.
> 
>> * Are you crashing or rebooting cleanly?
> 
> Rebooting cleanly, everything umounted happily including /home and /var.
> 
>> * Do you see log recovery messages in the logs for this filesystem?
> 
> My memory says yes, but nothing seems to be logged when this happens
> (though with my logs on the first filesystem damaged by this, this is
> rather hard to tell, they're all quite full of NULs by now).
> 
> I'll double-reboot tomorrow via the faulty kernel and check, unless I
> get asked not to in the interim. (And then double-reboot again to fsck
> everything...)
> 
>>> the full set of options for all my ext4 filesystems are:
>>>
>>> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,
>>> usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota
>>
>> ok journal_async_commit is off the reservation a bit; that's really not
>> tested, and Jan had serious reservations about its safety.
> 
> OK, well, I've been 'testing' it for years :) No problems until now. (If
> anything, I was more concerned about journal_checksum. I thought that
> had actually been implicated in corruption before now...)

It had, but I fixed it AFAIK; OTOH, we turned it off by default
after that episode.

>> * Can you reproduce this w/o journal_async_commit?
> 
> I can try!

Ok, fair enough.  If the BBU is working, nobarrier is ok; I don't trust
journal_async_commit, but that doesn't mean this isn't a regression.

Thanks for the answers... onward.  :)

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-23 Thread Nix
On 24 Oct 2012, Eric Sandeen uttered the following:

> On 10/23/12 3:57 PM, Nix wrote:
>> The only unusual thing about the filesystems on this machine are that
>> they have hardware RAID-5 (using the Areca driver), so I'm mounting with
>> 'nobarrier': 
>
> I should have read more.  :(  More questions follow:
>
> * Does the Areca have a battery backed write cache?

Yes (though I'm not powering off, just rebooting). Battery at 100% and
happy, though the lack of power-off means it's not actually getting
used, since the cache is obviously mains-backed as well.

> * Are you crashing or rebooting cleanly?

Rebooting cleanly, everything umounted happily including /home and /var.

> * Do you see log recovery messages in the logs for this filesystem?

My memory says yes, but nothing seems to be logged when this happens
(though with my logs on the first filesystem damaged by this, this is
rather hard to tell, they're all quite full of NULs by now).

I'll double-reboot tomorrow via the faulty kernel and check, unless I
get asked not to in the interim. (And then double-reboot again to fsck
everything...)

>> the full set of options for all my ext4 filesystems are:
>> 
>> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,
>> usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota
>
> ok journal_async_commit is off the reservation a bit; that's really not
> tested, and Jan had serious reservations about its safety.

OK, well, I've been 'testing' it for years :) No problems until now. (If
anything, I was more concerned about journal_checksum. I thought that
had actually been implicated in corruption before now...)

> * Can you reproduce this w/o journal_async_commit?

I can try!

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-23 Thread Eric Sandeen
On 10/23/12 3:57 PM, Nix wrote:



> (I'd provide more sample errors, but this bug has been eating
> newly-written logs in /var all day, so not much has survived.)
> 
> I rebooted into 3.6.1 rescue mode and fscked everything: lots of
> orphans, block group corruption and cross-linked files. The problems did
> not recur upon booting from 3.6.1 into 3.6.1 again. It is quite clear
> that metadata changes made in 3.6.3 are not making it to disk reliably,
> thus leading to corrupted filesystems marked clean on reboot into other
> kernels: pretty much every file appended to in 3.6.3 loses some or all
> of its appended data, and newly allocated blocks often end up
> cross-linked between multiple files.
> 
> The curious thing is this doesn't affect every filesystem: for a while
> it affected only /var, and now it's affecting only /var and /home. The
> massive writes to the ext4 filesystem mounted on /usr/src seem to have
> gone off without incident: fsck reports no problems.
> 
> 
> The only unusual thing about the filesystems on this machine are that
> they have hardware RAID-5 (using the Areca driver), so I'm mounting with
> 'nobarrier': 

I should have read more.  :(  More questions follow:

* Does the Areca have a battery backed write cache?
* Are you crashing or rebooting cleanly?
* Do you see log recovery messages in the logs for this filesystem?

> the full set of options for all my ext4 filesystems are:
> 
> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,
> usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota

ok journal_async_commit is off the reservation a bit; that's really not
tested, and Jan had serious reservations about its safety.

* Can you reproduce this w/o journal_async_commit?

-Eric

> If there's anything I can do to help, I'm happy to do it, once I've
> restored my home directory from backup :(
> 
> 
> tune2fs output for one of the afflicted filesystems (after fscking):
> 
> tune2fs 1.42.2 (9-Apr-2012)
> Filesystem volume name:   home
> Last mounted on:  /home
> Filesystem UUID:  95bd22c2-253c-456f-8e36-b6cfb9ecd4ef
> Filesystem magic number:  0xEF53
> Filesystem revision #:1 (dynamic)
> Filesystem features:  has_journal ext_attr resize_inode dir_index 
> filetype needs_recovery extent flex_bg sparse_super large_file huge_file 
> uninit_bg dir_nlink extra_isize
> Filesystem flags: signed_directory_hash
> Default mount options:(none)
> Filesystem state: clean
> Errors behavior:  Continue
> Filesystem OS type:   Linux
> Inode count:  3276800
> Block count:  13107200
> Reserved block count: 655360
> Free blocks:  5134852
> Free inodes:  3174777
> First block:  0
> Block size:   4096
> Fragment size:4096
> Reserved GDT blocks:  20
> Blocks per group: 32768
> Fragments per group:  32768
> Inodes per group: 8192
> Inode blocks per group:   512
> RAID stripe width:16
> Flex block group size:64
> Filesystem created:   Tue May 26 21:29:41 2009
> Last mount time:  Tue Oct 23 21:32:07 2012
> Last write time:  Tue Oct 23 21:32:07 2012
> Mount count:  2
> Maximum mount count:  20
> Last checked: Tue Oct 23 21:22:16 2012
> Check interval:   15552000 (6 months)
> Next check after: Sun Apr 21 21:22:16 2013
> Lifetime writes:  1092 GB
> Reserved blocks uid:  0 (user root)
> Reserved blocks gid:  0 (group root)
> First inode:  11
> Inode size:   256
> Required extra isize: 28
> Desired extra isize:  28
> Journal inode:8
> First orphan inode:   1572907
> Default directory hash:   half_md4
> Directory Hash Seed:  a201983d-d8a3-460b-93ca-eb7804b62c23
> Journal backup:   inode blocks
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-23 Thread Eric Sandeen
On 10/23/12 5:19 PM, Theodore Ts'o wrote:
> On Tue, Oct 23, 2012 at 09:57:08PM +0100, Nix wrote:
>>
>> It is now quite clear that this is a bug introduced by one or more of
>> the post-3.6.1 ext4 patches (which have all been backported at least to
>> 3.5, so the problem is probably there too).
>>
>> [   60.290844] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: 
>> group 202, 1583 clusters in bitmap, 1675 in gd
>> [   60.291426] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 
>> 0). There's a risk of filesystem corruption in case of system crash.
>>
> 
> I think I've found the problem.  I believe the commit at fault is commit
> 14b4ed22a6 (upstream commit eeecef0af5e):
> 
> jbd2: don't write superblock when if its empty
> 
> which first appeared in v3.6.2.
> 
> The reason why the problem happens rarely is that the effect of the
> buggy commit is that if the journal's starting block is zero, we fail
> to truncate the journal when we unmount the file system.  This can
> happen if we mount and then unmount the file system fairly quickly,
> before the log has a chance to wrap.After the first time this has
> happened, it's not a disaster, since when we replay the journal, we'll
> just replay some extra transactions.  But if this happens twice, the
> oldest valid transaction will still not have gotten updated, but some
> of the newer transactions from the last mount session will have gotten
> written by the very latest transacitons, and when we then try to do
> the extra transaction replays, the metadata blocks can end up getting
> very scrambled indeed.

I'm stumped by this; maybe Ted can see if I'm missing something.

(and Nix, is there anything special about your fs?  Any nondefault
mkfs or mount options, external journal, inordinately large fs, or
anything like that?)

The suspect commit added this in jbd2_mark_journal_empty():

/* Is it already empty? */
if (sb->s_start == 0) {
read_unlock(>j_state_lock);
return;
}

thereby short circuiting the function.

But Ted's suggestion that mounting the fs, doing a little work, and
unmounting before we wrap would lead to this doesn't make sense to
me.  When I do a little work, s_start is at 1, not 0.  We start
the journal at s_first:

load_superblock()
journal->j_first = be32_to_cpu(sb->s_first);

And when we wrap the journal, we wrap back to j_first:

jbd2_journal_next_log_block():
if (journal->j_head == journal->j_last)
journal->j_head = journal->j_first;

and j_first comes from s_first, which is set at journal creation
time to be "1" for an internal journal.

So s_start == 0 sure looks special to me; so far I can only see that
we get there if we've been through jbd2_mark_journal_empty() already,
though I'm eyeballing jbd2_journal_get_log_tail() as well.

Ted's proposed patch seems harmless but so far I don't understand
what problem it fixes, and I cannot recreate getting to
jbd2_mark_journal_empty() with a dirty log and s_start == 0.

-Eric

> *Sigh*.  My apologies for not catching this when I reviewed this
> patch.  I believe the following patch should fix the bug; once it's
> reviewed by other ext4 developers, I'll push this to Linus ASAP.
> 
>   - Ted
> 
> commit 26de1ba5acc39f0ab57ce1ed523cb128e4ad73a4
> Author: Theodore Ts'o 
> Date:   Tue Oct 23 18:15:22 2012 -0400
> 
> jbd2: fix a potential fs corrupting bug in jbd2_mark_journal_empty
> 
> Fix a potential file system corrupting bug which was introduced by
> commit eeecef0af5ea4efd763c9554cf2bd80fc4a0efd3: jbd2: don't write
> superblock when if its empty.
> 
> We should only skip writing the journal superblock if there is nothing
> to do --- not just when s_start is zero.
> 
> This has caused users to report file system corruptions in ext4 that
> look like this:
> 
> EXT4-fs error (device sdb3): ext4_mb_generate_buddy:741: group 436, 22902 
> clusters in bitmap, 22901 in gd
> JBD2: Spotted dirty metadata buffer (dev = sdb3, blocknr = 0). There's a 
> risk of filesystem corruption in case of system crash.
> 
> after the file system has been corrupted.
> 
> Signed-off-by: "Theodore Ts'o" 
> Cc: sta...@vger.kernel.org
> 
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 0f16edd..0064181 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -1351,18 +1351,20 @@ void jbd2_journal_update_sb_log_tail(journal_t 
> *journal, tid_t tail_tid,
>  static void jbd2_mark_journal_empty(journal_t *journal)
>  {
>   journal_superblock_t *sb = journal->j_superblock;
> + __be32  new_tail_sequence;
>  
>   BUG_ON(!mutex_is_locked(>j_checkpoint_mutex));
>   read_lock(>j_state_lock);
> - /* Is it already empty? */
> - if (sb->s_start == 0) {
> + new_tail_sequence = cpu_to_be32(journal->j_tail_sequence);
> + /* Nothing to do? */
> + if 

Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-23 Thread Nix
On 24 Oct 2012, Theodore Ts'o told this:

> hurt, but we do want to make 100% sure that it really fixes the
> problem.

Well, yes, that would be nice. I can certainly try to verify that it
stops my filesystems getting corrupted. (And if so, I owe you a
$BEVERAGE. Though I suspect I owe you about three million of those
already for other code written in the past.)

>> The bug did really quite a lot of damage to my /home fs in only a few
>> minutes of uptime, given how few files I wrote to it. What it could have
>> done to a more conventional distro install with everything including
>> /home on one filesystem, I shudder to think.
>
> Well, the problem won't show up if the journal has wrapped.  So it
> will only show up if the system has been rebooted twice in fairly
> quick succession.  A full conventional distro install probably
> wouldn't have triggered a bug...

A full *install* from scratch, no. I was more worried about the
possibility of someone running -stable kernels on an existing distro
installation, and shutting down every night (given what's been happening
to UK electricity prices in the last few years I suspect there are quite
a lot of people doing that in the UK to save power). If they happen not
to do much on one particular day other than a bit of light distro
updating, they could perfectly well end up roasting things touched
during the distro update. Things like glibc :(

>  although someone who habitually
> reboots their laptop instead of using suspend/resume or hiberbate, or
> someone who is trying to bisect the kernel looking for some other bug
> could easily trip over this --- which I guess is how you got hit by
> it.

I was first hit by it in /var before I was even trying to bisect: I was
just rebooting to unwedge NFS lockd. It's true that in less than a week
probably not all that many people have rebooted often enough to trip
over this.

I hope.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >