Re: EXT4-fs error w/ external USB drive

2012-10-26 Thread Toralf Förster
On 10/24/2012 08:35 PM, Theodore Ts'o wrote:
> On Wed, Oct 24, 2012 at 07:31:57PM +0200, Toralf Förster wrote:
>>> Are you using any kind of special mount options on your usb stick?
>>>
>> nope
> 
> Thanks, we're trying to get a reliable repro of this failure, and so
> every bit of data helps...  I've cc'ed you on the other thread, and if
> you could try the second patch I sent out last night (and let me know
> when/if the WARN_ON triggers), I'd really appreciate it.
> 
> Thanks again,
> 
>   - Ted
> 
Well, here it is :

2012-10-25T21:05:28.000+02:00 n22 sudo: tfoerste : TTY=pts/2 ; 
PWD=/home/tfoerste/virtual/uml ; USER=root ; COMMAND=/bin/su -
2012-10-25T21:05:28.000+02:00 n22 sudo: pam_unix(sudo:session): session opened 
for user root by tfoerste(uid=0)
2012-10-25T21:05:28.000+02:00 n22 su[18998]: Successful su for root by root
2012-10-25T21:05:28.000+02:00 n22 su[18998]: + /dev/pts/2 root:root
2012-10-25T21:05:28.000+02:00 n22 su[18998]: pam_unix(su:session): session 
opened for user root by tfoerste(uid=0)
2012-10-25T21:05:44.880+02:00 n22 kernel: EXT4-fs (loop0): mounted filesystem 
with ordered data mode. Opts: (null)
2012-10-25T21:07:07.218+02:00 n22 kernel: JBD2: jbd2_mark_journal_empty bug 
workaround (79, 80)
2012-10-25T21:07:07.218+02:00 n22 kernel: [ cut here ]
2012-10-25T21:07:07.218+02:00 n22 kernel: WARNING: at fs/jbd2/journal.c:1364 
jbd2_mark_journal_empty+0xef/0x110()
2012-10-25T21:07:07.218+02:00 n22 kernel: Hardware name: 4180F65
2012-10-25T21:07:07.218+02:00 n22 kernel: Modules linked in: bluetooth 
cpufreq_stats loop ipt_MASQUERADE xt_owner xt_multiport ipt_REJECT xt_recent 
xt_tcpudp xt_mac nf_conntrack_ftp xt_state xt_limit xt_LOG iptable_nat nf_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables 
af_packet pppoe pppox ppp_generic slhc bridge stp llc tun msr i915 coretemp 
cfbfillrect cfbimgblt i2c_algo_bit cfbcopyarea fbcon bitblit 
snd_hda_codec_conexant softcursor font snd_hda_intel snd_hda_codec kvm_intel 
snd_pcm intel_agp 8250_pci intel_gtt drm_kms_helper snd_page_alloc snd_timer 
kvm 8250 drm thinkpad_acpi nvram uvcvideo snd serial_core agpgart fb sdhci_pci 
usblp videobuf2_vmalloc e1000e videobuf2_memops videobuf2_core videodev 
i2c_i801 tpm_tis soundcore hwmon arc4 sdhci i2c_core tpm fbdev iwldvm 
acpi_cpufreq mac80211 mperf ac psmouse iwlwifi battery button cfg80211 rfkill 
evdev mmc_core processor video tpm_bios thermal wmi xts gf128mul aesni_intel 
ablk_helper cryptd aes_i58
6 aes_generic cbc fuse nfs lockd sunrpc dm_crypt dm_mod hid_monterey 
hid_microsoft hid_logitech hid_ezkey hid_cypress hid_chicony hid_cherry 
hid_belkin hid_apple hid_a4tech hid_generic usbhid hid sr_mod cdrom sg [last 
unloaded: microcode]
2012-10-25T21:07:07.218+02:00 n22 kernel: Pid: 19040, comm: umount Not tainted 
3.6.3 #8
2012-10-25T21:07:07.218+02:00 n22 kernel: Call Trace:
2012-10-25T21:07:07.218+02:00 n22 kernel: [] 
warn_slowpath_common+0x72/0xa0
2012-10-25T21:07:07.218+02:00 n22 kernel: [] ? 
jbd2_mark_journal_empty+0xef/0x110
2012-10-25T21:07:07.218+02:00 n22 kernel: [] ? 
jbd2_mark_journal_empty+0xef/0x110
2012-10-25T21:07:07.223+02:00 n22 kernel: [] 
warn_slowpath_null+0x22/0x30
2012-10-25T21:07:07.223+02:00 n22 kernel: [] 
jbd2_mark_journal_empty+0xef/0x110
2012-10-25T21:07:07.223+02:00 n22 kernel: [] 
jbd2_journal_destroy+0x1ce/0x1f0
2012-10-25T21:07:07.223+02:00 n22 kernel: [] ? 
add_wait_queue+0x50/0x50
2012-10-25T21:07:07.223+02:00 n22 kernel: [] ext4_put_super+0x4a/0x2e0
2012-10-25T21:07:07.223+02:00 n22 kernel: [] ? dispose_list+0x32/0x40
2012-10-25T21:07:07.223+02:00 n22 kernel: [] ? evict_inodes+0x8f/0xe0
2012-10-25T21:07:07.223+02:00 n22 kernel: [] 
generic_shutdown_super+0x51/0xd0
2012-10-25T21:07:07.223+02:00 n22 kernel: [] ? 
pcpu_free_area+0x145/0x190
2012-10-25T21:07:07.223+02:00 n22 kernel: [] 
kill_block_super+0x29/0x70
2012-10-25T21:07:07.224+02:00 n22 kernel: [] 
deactivate_locked_super+0x30/0x90
2012-10-25T21:07:07.224+02:00 n22 kernel: [] 
deactivate_super+0x47/0x60
2012-10-25T21:07:07.224+02:00 n22 kernel: [] 
mntput_no_expire+0xcd/0x120
2012-10-25T21:07:07.224+02:00 n22 kernel: [] sys_umount+0x6a/0x330
2012-10-25T21:07:07.224+02:00 n22 kernel: [] sys_oldumount+0x1e/0x20
2012-10-25T21:07:07.224+02:00 n22 kernel: [] 
sysenter_do_call+0x12/0x22
2012-10-25T21:07:07.224+02:00 n22 kernel: ---[ end trace 8e7416a7368818fe ]---
2012-10-25T21:07:07.000+02:00 n22 su[18998]: pam_unix(su:session): session 
closed for user root
2012-10-25T21:07:07.000+02:00 n22 sudo: pam_unix(sudo:session): session closed 
for user root


-- 
MfG/Sincerely
Toralf Förster
pgp finger print: 7B1A 07F4 EC82 0F90 D4C2 8936 872A E508 7DB6 9DA3
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: EXT4-fs error w/ external USB drive

2012-10-25 Thread Eric Sandeen
On 10/25/12 1:20 PM, Theodore Ts'o wrote:
> On Thu, Oct 25, 2012 at 06:39:30PM +0200, Toralf Förster wrote:
>> After a lot of file operations (Gentoo emerging, kernel build, git
>> pulls, ...) I s2disk the system (that with the external USB drive)
>> yesterday, wake it up today, rebooted it -
>> and had to manually repair the file system, because the automatic fsck
>> gave up.
> 
> OK, I'm going to send another patch series which I'd hope you could
> test to see if reduces the rate at which this happens.
> 
>> Nevertheless there's another Linux system I have (64bit RH EL,internal
>> drive), where with kernel 3.5.4-1.el6.elrepo.x86_64 EXT4 errors occurred.
>> I attached the whole appropriate section of /var/log/message.
> 
> I don't have easy access to the RHEL kernel sources, and so I don't

Just FWIW, that's a 3rd party kernel, not something Red Hat
ships.  (see "elrepo")

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: EXT4-fs error w/ external USB drive

2012-10-25 Thread Theodore Ts'o
On Thu, Oct 25, 2012 at 06:39:30PM +0200, Toralf Förster wrote:
> After a lot of file operations (Gentoo emerging, kernel build, git
> pulls, ...) I s2disk the system (that with the external USB drive)
> yesterday, wake it up today, rebooted it -
> and had to manually repair the file system, because the automatic fsck
> gave up.

OK, I'm going to send another patch series which I'd hope you could
test to see if reduces the rate at which this happens.

> Nevertheless there's another Linux system I have (64bit RH EL,internal
> drive), where with kernel 3.5.4-1.el6.elrepo.x86_64 EXT4 errors occurred.
> I attached the whole appropriate section of /var/log/message.

I don't have easy access to the RHEL kernel sources, and so I don't
know which patches were applied.  Specifically, I'd really like to
know if the commit represented by 14b4ed22a6 is in RHEL
3.5.4-1.el6.elrepo.  Also, I'd like to know which line number was
reflected here, which was the first EXT4-fs error:

> Sep 26 09:26:54 x kernel: EXT4-fs error (device dm-1) in ext4_new_inode:938: 
> IO failure

This was from fs/ext4/ialloc.c line 938, and there are two
ext4_std_error() that this could represent, so which is why having the
exact kernel sources from this RHEL kernel would be useful.  (I'd also
suggest opening a RHEL support ticket if you have a support contract,
since that way Red Hat can track this issue, and that way Eric can
count the work he's been doing on this fire drill as supporting a
customer.  :-)

What's a bit unfortunate is that there was no other error messages
before this line.  So we can't know for sure what caused or returned
the -EIO error code.  I *suspect* it was this, which would would be
indocate a corrupted inode bitmap:

if (insert_inode_locked(inode) < 0) {
/*
 * Likely a bitmap corruption causing inode to be allocated
 * twice.
 */
err = -EIO;
goto fail;
}

Do you know if this external disk could have suffered from a cable
pull, or a flaky cable, or some kind of unclean shutdown/power failure
before it rebooted?  That would be an interesting data point.  

For the future, we need to add some better error reporting for
failures such as this.  In addition, I have a recent change we made at
work that I should get upstream which avoids allocating from a block
group once we notice a corruption (currently just for the block
allocations, but I think we should do this for inode allocations as
well), to minimize the chances of lost data once we notice that the
block/inode allocation bitmap can't be trusted.  This avoids data loss
in the case where users are using the default errors=continue instead
of errors=panic or errors=remount-ro.

Speaking of which, for your production RHEL server, you might want to
seriously consider errors=panic for any critical file system volume.
This allows the file system to get corrected via e2fsck, and prevents
the server from stumbling along, possibly causing more data loss due
to a fs corruption.

Regards,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: EXT4-fs error w/ external USB drive

2012-10-24 Thread Theodore Ts'o
On Wed, Oct 24, 2012 at 07:31:57PM +0200, Toralf Förster wrote:
> > Are you using any kind of special mount options on your usb stick?
> >
> nope

Thanks, we're trying to get a reliable repro of this failure, and so
every bit of data helps...  I've cc'ed you on the other thread, and if
you could try the second patch I sent out last night (and let me know
when/if the WARN_ON triggers), I'd really appreciate it.

Thanks again,

  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: EXT4-fs error w/ external USB drive

2012-10-24 Thread Toralf Förster
On 10/24/2012 03:11 AM, Theodore Ts'o wrote:
> Toralf,
> 
> Are you using any kind of special mount options on your usb stick?
> 
> Thanks,
> 
>   - Ted
nope

tfoerste@n22 ~/devel/linux $ grep ext4 /etc/fstab
/dev/sdb3   /   ext4noatime
0 1


-- 
MfG/Sincerely
Toralf Förster
pgp finger print: 7B1A 07F4 EC82 0F90 D4C2 8936 872A E508 7DB6 9DA3
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: EXT4-fs error w/ external USB drive

2012-10-23 Thread Theodore Ts'o
Toralf,

Are you using any kind of special mount options on your usb stick?

Thanks,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: EXT4-fs error w/ external USB drive

2012-10-22 Thread Toralf Förster
On 10/19/2012 11:07 PM, Theodore Ts'o wrote:
> On Mon, Oct 15, 2012 at 07:46:02PM +0200, Toralf Förster wrote:
>> Even with current stable kernel 3.6.2 I sometimes get those syslog messages :
>>
>>
>> 2012-10-15T19:37:58.401+02:00 n22 kernel: EXT4-fs error (device sdb3): 
>> ext4_mb_generate_buddy:741: group 436, 22902 clusters in bitmap, 22901 in gd
> 
> Have you run e2fsck to clean up the file system corruption?  If you
> have, do you continually get these errors afterwards?

Well, I got it yesterday too :

n22 ~ # zgrep  ext4_mb_generate_buddy /var/log/messages-201210* 

  
/var/log/messages-20121021.gz:2012-10-15T19:05:39.189+02:00 n22 kernel: EXT4-fs 
error (device sdb3): ext4_mb_generate_buddy:741: group 774, 27157 clusters in 
bitmap, 27052 in gd   
/var/log/messages-20121021.gz:2012-10-15T19:37:58.401+02:00 n22 kernel: EXT4-fs 
error (device sdb3): ext4_mb_generate_buddy:741: group 436, 22902 clusters in 
bitmap, 22901 in gd   
/var/log/messages-20121021.gz:2012-10-15T19:56:05.301+02:00 n22 kernel: EXT4-fs 
error (device sdb3): ext4_mb_generate_buddy:741: group 1233, 11981 clusters in 
bitmap, 11974 in gd  
/var/log/messages-20121021.gz:2012-10-15T19:56:18.601+02:00 n22 kernel: EXT4-fs 
error (device sdb3): ext4_mb_generate_buddy:741: group 484, 28817 clusters in 
bitmap, 28101 in gd  

I rebooted the system and forced a run of fsck after these lines too.
I'll check periodically whether it happens again.
 
> You say this is an external USB disk; is there any possibility of the
> disk getting unmounted uncleanly due to the cable getting pulled out
> while the disk is still mounted, and then the disk getting remounted
> w/o having e2fsck run on the disk?

For the first occurrence probably yes (better : I dunno), but yesterday 
definitely not.



-- 
MfG/Sincerely
Toralf Förster
pgp finger print: 7B1A 07F4 EC82 0F90 D4C2 8936 872A E508 7DB6 9DA3
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: EXT4-fs error w/ external USB drive

2012-10-19 Thread Theodore Ts'o
On Mon, Oct 15, 2012 at 07:46:02PM +0200, Toralf Förster wrote:
> Even with current stable kernel 3.6.2 I sometimes get those syslog messages :
> 
> 
> 2012-10-15T19:37:58.401+02:00 n22 kernel: EXT4-fs error (device sdb3): 
> ext4_mb_generate_buddy:741: group 436, 22902 clusters in bitmap, 22901 in gd

Have you run e2fsck to clean up the file system corruption?  If you
have, do you continually get these errors afterwards?

You say this is an external USB disk; is there any possibility of the
disk getting unmounted uncleanly due to the cable getting pulled out
while the disk is still mounted, and then the disk getting remounted
w/o having e2fsck run on the disk?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/