Re: 2.6.24-rc3-mm1: I/O error, system hangs
Le 25.11.2007 21:39, Laurent Riffard a écrit : > Le 25.11.2007 08:37, James Bottomley a écrit : >> On Sat, 2007-11-24 at 23:59 +0100, Laurent Riffard wrote: >>> Le 24.11.2007 14:26, James Bottomley a écrit : OK, could you post dmesgs again, please. I actually tested this >>> with an aic79xx card, and for me it does cause Domain Validation to succeed again. >>> James, >>> >>> Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch "separates >>> the >>> BLOCK and QUIESCE states >>> correctly" (http://lkml.org/lkml/2007/11/24/8). >>> [...] >>> [ 25.521256] scsi0 : pata_via >>> [ 25.521711] scsi1 : pata_via >>> [ 25.524089] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xb800 irq >>> 14 >>> [ 25.524176] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xb808 irq >>> 15 >>> [ 25.683141] ata1.00: ATA-5: ST340016A, 3.75, max UDMA/100 >>> [ 25.683208] ata1.00: 78165360 sectors, multi 16: LBA >>> [ 25.683475] ata1.01: ATA-7: Maxtor 6Y080L0, YAR41BW0, max UDMA/133 >>> [ 25.684116] ata1.01: 160086528 sectors, multi 16: LBA >>> [ 25.691127] ata1.00: configured for UDMA/100 >>> [ 25.699142] ata1.01: configured for UDMA/100 >>> [ 26.170860] ata2.00: ATAPI: HL-DT-ST DVDRAM GSA-4165B, DL05, max UDMA/33 >>> [ 26.171562] ata2.01: ATAPI: CD-950E/AKU, A4Q, max MWDMA2, CDB intr >>> [ 26.330839] ata2.00: configured for UDMA/33 >>> [ 26.490828] ata2.01: configured for MWDMA2 >>> [ 26.503014] scsi 0:0:0:0: Direct-Access ATA ST340016A 3.75 PQ: >>> 0 ANSI: 5 >>> [ 26.504670] scsi 0:0:1:0: Direct-Access ATA Maxtor 6Y080L0 YAR4 >>> PQ: 0 ANSI: 5 >>> [ 26.509842] scsi 1:0:0:0: CD-ROMHL-DT-ST DVDRAM GSA-4165B >>> DL05 PQ: 0 ANSI: 5 >>> [ 26.511673] scsi 1:0:1:0: CD-ROME-IDECD-950E/AKU A4Q >>> PQ: 0 ANSI: 5 >> [...] >>> [ 60.216113] sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT >>> driverbyte=DRIVER_OK,SUGGEST_OK >>> [ 60.216124] end_request: I/O error, dev sda, sector 16460 >> I think this one's quite easy: PATA devices in libata are queue depth 1 >> (since they don't do NCQ). Thus, they're peculiarly sensitive to the >> bug where we fail over queue depth requests. >> >> On the other hand, I don't see how a filesystem request is getting >> REQ_FAILFAST ... unless there's a bio or readahead issue involved. >> Anyway, could you try this patch: >> >> http://marc.info/?l=linux-scsi=119592627425498 >> >> Which should fix the queue depth issue, and see if the errors go away? > > No, this one doesn't help... still happens with 2.6.24-rc3-mm2... -- laurent - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
Le 25.11.2007 21:39, Laurent Riffard a écrit : Le 25.11.2007 08:37, James Bottomley a écrit : On Sat, 2007-11-24 at 23:59 +0100, Laurent Riffard wrote: Le 24.11.2007 14:26, James Bottomley a écrit : OK, could you post dmesgs again, please. I actually tested this with an aic79xx card, and for me it does cause Domain Validation to succeed again. James, Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch separates the BLOCK and QUIESCE states correctly (http://lkml.org/lkml/2007/11/24/8). [...] [ 25.521256] scsi0 : pata_via [ 25.521711] scsi1 : pata_via [ 25.524089] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xb800 irq 14 [ 25.524176] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xb808 irq 15 [ 25.683141] ata1.00: ATA-5: ST340016A, 3.75, max UDMA/100 [ 25.683208] ata1.00: 78165360 sectors, multi 16: LBA [ 25.683475] ata1.01: ATA-7: Maxtor 6Y080L0, YAR41BW0, max UDMA/133 [ 25.684116] ata1.01: 160086528 sectors, multi 16: LBA [ 25.691127] ata1.00: configured for UDMA/100 [ 25.699142] ata1.01: configured for UDMA/100 [ 26.170860] ata2.00: ATAPI: HL-DT-ST DVDRAM GSA-4165B, DL05, max UDMA/33 [ 26.171562] ata2.01: ATAPI: CD-950E/AKU, A4Q, max MWDMA2, CDB intr [ 26.330839] ata2.00: configured for UDMA/33 [ 26.490828] ata2.01: configured for MWDMA2 [ 26.503014] scsi 0:0:0:0: Direct-Access ATA ST340016A 3.75 PQ: 0 ANSI: 5 [ 26.504670] scsi 0:0:1:0: Direct-Access ATA Maxtor 6Y080L0 YAR4 PQ: 0 ANSI: 5 [ 26.509842] scsi 1:0:0:0: CD-ROMHL-DT-ST DVDRAM GSA-4165B DL05 PQ: 0 ANSI: 5 [ 26.511673] scsi 1:0:1:0: CD-ROME-IDECD-950E/AKU A4Q PQ: 0 ANSI: 5 [...] [ 60.216113] sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK [ 60.216124] end_request: I/O error, dev sda, sector 16460 I think this one's quite easy: PATA devices in libata are queue depth 1 (since they don't do NCQ). Thus, they're peculiarly sensitive to the bug where we fail over queue depth requests. On the other hand, I don't see how a filesystem request is getting REQ_FAILFAST ... unless there's a bio or readahead issue involved. Anyway, could you try this patch: http://marc.info/?l=linux-scsim=119592627425498 Which should fix the queue depth issue, and see if the errors go away? No, this one doesn't help... still happens with 2.6.24-rc3-mm2... -- laurent - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
On Sat, Nov 24, 2007 at 07:44:13PM +0200, James Bottomley wrote: > Probing intermittent failures in Domain Validation, even with the fixes > applied leads me to the conclusion that there are further problems with > this commit: > > commit fc5eb4facedbd6d7117905e775cee1975f894e79 > Author: Hannes Reinecke <[EMAIL PROTECTED]> > Date: Tue Nov 6 09:23:40 2007 +0100 > > [SCSI] Do not requeue requests if REQ_FAILFAST is set > > The essence of the problems is that you're causing REQ_FAILFAST to > terminate commands with error on requeuing conditions, some of which are > relatively common on most SCSI devices. While this may be the correct > behaviour for multi-path, it's certainly wrong for the previously > understood meaning of REQ_FAILFAST, which was don't retry on error, > which is why domain validation and other applications use it to control > error handling, but don't expect to get failures for a simple requeue > are now spitting errors. > > I honestly can't see that, even for the multi-path case, returning an > error when we're over queue depth is the correct thing to do (it may not > matter to something like a symmetrix, but an array that has a non-zero > cost associated with a path change, like a CPQ HSV or the AVT > controllers, will show fairly large slow downs if you do this). Even if > this is the desired behaviour (and I think that's a policy issue), > DID_NO_CONNECT is almost certainly the wrong error to be sending back. > > This patch fixes up domain validation to work again correctly, however, > I really think it's just a bandaid. Do you want to rethink the above > commit? > Given the amounted error, yes, I'll have to. But we still face the initial problem that requeued requests will be stuck in the queue forever (ie until the timeout catches it), causing failover to be painfully slow. Anyway, I'll think it over. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage [EMAIL PROTECTED] +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: Markus Rex, HRB 16746 (AG N�rnberg) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
Le 25.11.2007 08:37, James Bottomley a écrit : > On Sat, 2007-11-24 at 23:59 +0100, Laurent Riffard wrote: >> Le 24.11.2007 14:26, James Bottomley a écrit : >>> OK, could you post dmesgs again, please. I actually tested this >> with an >>> aic79xx card, and for me it does cause Domain Validation to succeed >>> again. >> James, >> >> Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch "separates >> the >> BLOCK and QUIESCE states >> correctly" (http://lkml.org/lkml/2007/11/24/8). >> >> How to reproduce : >> - boot >> - switch to a text console >> - capture dmesg in a file, sync, etc. There are 3 I/O errors, but the >> system does work. >> - switch to X console, log in the Gnome Desktop, the system partially >> hangs. >> - switch back to a text console: dmesg(1) still works, it shows some >> additonal I/O errors. At this point, any disk access makes the system >> completely hung. >> >> Additionnal data: >> - the I/O errors always happen on the same blocks. >> >> plain text document attachment (dmesg-2.6.24-rc3-mm1-patched) > [...] >> [ 25.521256] scsi0 : pata_via >> [ 25.521711] scsi1 : pata_via >> [ 25.524089] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xb800 irq >> 14 >> [ 25.524176] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xb808 irq >> 15 >> [ 25.683141] ata1.00: ATA-5: ST340016A, 3.75, max UDMA/100 >> [ 25.683208] ata1.00: 78165360 sectors, multi 16: LBA >> [ 25.683475] ata1.01: ATA-7: Maxtor 6Y080L0, YAR41BW0, max UDMA/133 >> [ 25.684116] ata1.01: 160086528 sectors, multi 16: LBA >> [ 25.691127] ata1.00: configured for UDMA/100 >> [ 25.699142] ata1.01: configured for UDMA/100 >> [ 26.170860] ata2.00: ATAPI: HL-DT-ST DVDRAM GSA-4165B, DL05, max UDMA/33 >> [ 26.171562] ata2.01: ATAPI: CD-950E/AKU, A4Q, max MWDMA2, CDB intr >> [ 26.330839] ata2.00: configured for UDMA/33 >> [ 26.490828] ata2.01: configured for MWDMA2 >> [ 26.503014] scsi 0:0:0:0: Direct-Access ATA ST340016A 3.75 PQ: 0 >> ANSI: 5 >> [ 26.504670] scsi 0:0:1:0: Direct-Access ATA Maxtor 6Y080L0 YAR4 >> PQ: 0 ANSI: 5 >> [ 26.509842] scsi 1:0:0:0: CD-ROMHL-DT-ST DVDRAM GSA-4165B >> DL05 PQ: 0 ANSI: 5 >> [ 26.511673] scsi 1:0:1:0: CD-ROME-IDECD-950E/AKU A4Q PQ: >> 0 ANSI: 5 > [...] >> [ 60.216113] sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT >> driverbyte=DRIVER_OK,SUGGEST_OK >> [ 60.216124] end_request: I/O error, dev sda, sector 16460 > > I think this one's quite easy: PATA devices in libata are queue depth 1 > (since they don't do NCQ). Thus, they're peculiarly sensitive to the > bug where we fail over queue depth requests. > > On the other hand, I don't see how a filesystem request is getting > REQ_FAILFAST ... unless there's a bio or readahead issue involved. > Anyway, could you try this patch: > > http://marc.info/?l=linux-scsi=119592627425498 > > Which should fix the queue depth issue, and see if the errors go away? No, this one doesn't help... -- laurent - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
Le 25.11.2007 08:37, James Bottomley a écrit : On Sat, 2007-11-24 at 23:59 +0100, Laurent Riffard wrote: Le 24.11.2007 14:26, James Bottomley a écrit : OK, could you post dmesgs again, please. I actually tested this with an aic79xx card, and for me it does cause Domain Validation to succeed again. James, Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch separates the BLOCK and QUIESCE states correctly (http://lkml.org/lkml/2007/11/24/8). How to reproduce : - boot - switch to a text console - capture dmesg in a file, sync, etc. There are 3 I/O errors, but the system does work. - switch to X console, log in the Gnome Desktop, the system partially hangs. - switch back to a text console: dmesg(1) still works, it shows some additonal I/O errors. At this point, any disk access makes the system completely hung. Additionnal data: - the I/O errors always happen on the same blocks. plain text document attachment (dmesg-2.6.24-rc3-mm1-patched) [...] [ 25.521256] scsi0 : pata_via [ 25.521711] scsi1 : pata_via [ 25.524089] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xb800 irq 14 [ 25.524176] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xb808 irq 15 [ 25.683141] ata1.00: ATA-5: ST340016A, 3.75, max UDMA/100 [ 25.683208] ata1.00: 78165360 sectors, multi 16: LBA [ 25.683475] ata1.01: ATA-7: Maxtor 6Y080L0, YAR41BW0, max UDMA/133 [ 25.684116] ata1.01: 160086528 sectors, multi 16: LBA [ 25.691127] ata1.00: configured for UDMA/100 [ 25.699142] ata1.01: configured for UDMA/100 [ 26.170860] ata2.00: ATAPI: HL-DT-ST DVDRAM GSA-4165B, DL05, max UDMA/33 [ 26.171562] ata2.01: ATAPI: CD-950E/AKU, A4Q, max MWDMA2, CDB intr [ 26.330839] ata2.00: configured for UDMA/33 [ 26.490828] ata2.01: configured for MWDMA2 [ 26.503014] scsi 0:0:0:0: Direct-Access ATA ST340016A 3.75 PQ: 0 ANSI: 5 [ 26.504670] scsi 0:0:1:0: Direct-Access ATA Maxtor 6Y080L0 YAR4 PQ: 0 ANSI: 5 [ 26.509842] scsi 1:0:0:0: CD-ROMHL-DT-ST DVDRAM GSA-4165B DL05 PQ: 0 ANSI: 5 [ 26.511673] scsi 1:0:1:0: CD-ROME-IDECD-950E/AKU A4Q PQ: 0 ANSI: 5 [...] [ 60.216113] sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK [ 60.216124] end_request: I/O error, dev sda, sector 16460 I think this one's quite easy: PATA devices in libata are queue depth 1 (since they don't do NCQ). Thus, they're peculiarly sensitive to the bug where we fail over queue depth requests. On the other hand, I don't see how a filesystem request is getting REQ_FAILFAST ... unless there's a bio or readahead issue involved. Anyway, could you try this patch: http://marc.info/?l=linux-scsim=119592627425498 Which should fix the queue depth issue, and see if the errors go away? No, this one doesn't help... -- laurent - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
On Sat, Nov 24, 2007 at 07:44:13PM +0200, James Bottomley wrote: Probing intermittent failures in Domain Validation, even with the fixes applied leads me to the conclusion that there are further problems with this commit: commit fc5eb4facedbd6d7117905e775cee1975f894e79 Author: Hannes Reinecke [EMAIL PROTECTED] Date: Tue Nov 6 09:23:40 2007 +0100 [SCSI] Do not requeue requests if REQ_FAILFAST is set The essence of the problems is that you're causing REQ_FAILFAST to terminate commands with error on requeuing conditions, some of which are relatively common on most SCSI devices. While this may be the correct behaviour for multi-path, it's certainly wrong for the previously understood meaning of REQ_FAILFAST, which was don't retry on error, which is why domain validation and other applications use it to control error handling, but don't expect to get failures for a simple requeue are now spitting errors. I honestly can't see that, even for the multi-path case, returning an error when we're over queue depth is the correct thing to do (it may not matter to something like a symmetrix, but an array that has a non-zero cost associated with a path change, like a CPQ HSV or the AVT controllers, will show fairly large slow downs if you do this). Even if this is the desired behaviour (and I think that's a policy issue), DID_NO_CONNECT is almost certainly the wrong error to be sending back. This patch fixes up domain validation to work again correctly, however, I really think it's just a bandaid. Do you want to rethink the above commit? Given the amounted error, yes, I'll have to. But we still face the initial problem that requeued requests will be stuck in the queue forever (ie until the timeout catches it), causing failover to be painfully slow. Anyway, I'll think it over. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage [EMAIL PROTECTED] +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: Markus Rex, HRB 16746 (AG N�rnberg) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
On Sat, 2007-11-24 at 23:59 +0100, Laurent Riffard wrote: > Le 24.11.2007 14:26, James Bottomley a écrit : > > OK, could you post dmesgs again, please. I actually tested this > with an > > aic79xx card, and for me it does cause Domain Validation to succeed > > again. > > James, > > Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch "separates > the > BLOCK and QUIESCE states > correctly" (http://lkml.org/lkml/2007/11/24/8). > > How to reproduce : > - boot > - switch to a text console > - capture dmesg in a file, sync, etc. There are 3 I/O errors, but the > system does work. > - switch to X console, log in the Gnome Desktop, the system partially > hangs. > - switch back to a text console: dmesg(1) still works, it shows some > additonal I/O errors. At this point, any disk access makes the > system > completely hung. > > Additionnal data: > - the I/O errors always happen on the same blocks. > > plain text document attachment (dmesg-2.6.24-rc3-mm1-patched) [...] > [ 25.521256] scsi0 : pata_via > [ 25.521711] scsi1 : pata_via > [ 25.524089] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma > 0xb800 irq 14 > [ 25.524176] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma > 0xb808 irq 15 > [ 25.683141] ata1.00: ATA-5: ST340016A, 3.75, max UDMA/100 > [ 25.683208] ata1.00: 78165360 sectors, multi 16: LBA > [ 25.683475] ata1.01: ATA-7: Maxtor 6Y080L0, YAR41BW0, max UDMA/133 > [ 25.684116] ata1.01: 160086528 sectors, multi 16: LBA > [ 25.691127] ata1.00: configured for UDMA/100 > [ 25.699142] ata1.01: configured for UDMA/100 > [ 26.170860] ata2.00: ATAPI: HL-DT-ST DVDRAM GSA-4165B, DL05, max > UDMA/33 > [ 26.171562] ata2.01: ATAPI: CD-950E/AKU, A4Q, max MWDMA2, CDB intr > [ 26.330839] ata2.00: configured for UDMA/33 > [ 26.490828] ata2.01: configured for MWDMA2 > [ 26.503014] scsi 0:0:0:0: Direct-Access ATA ST340016A > 3.75 PQ: 0 ANSI: 5 > [ 26.504670] scsi 0:0:1:0: Direct-Access ATA Maxtor 6Y080L0 > YAR4 PQ: 0 ANSI: 5 > [ 26.509842] scsi 1:0:0:0: CD-ROMHL-DT-ST DVDRAM > GSA-4165B DL05 PQ: 0 ANSI: 5 > [ 26.511673] scsi 1:0:1:0: CD-ROME-IDECD-950E/AKU > A4Q PQ: 0 ANSI: 5 [...] > [ 60.216113] sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT > driverbyte=DRIVER_OK,SUGGEST_OK > [ 60.216124] end_request: I/O error, dev sda, sector 16460 I think this one's quite easy: PATA devices in libata are queue depth 1 (since they don't do NCQ). Thus, they're peculiarly sensitive to the bug where we fail over queue depth requests. On the other hand, I don't see how a filesystem request is getting REQ_FAILFAST ... unless there's a bio or readahead issue involved. Anyway, could you try this patch: http://marc.info/?l=linux-scsi=119592627425498 Which should fix the queue depth issue, and see if the errors go away? Thanks, James - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
Le 24.11.2007 14:26, James Bottomley a écrit : > On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote: >> Le 24.11.2007 07:42, James Bottomley a écrit : >>> On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote: Le 23.11.2007 12:38, Hannes Reinecke a écrit : [snip] I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 does fix the problem. >> Hmm. Weird. I'll have a look into it. Apparently I'll be returning an >> error where >> I shouldn't. Checking ... >> > Ok, found it. We are blocking even special commands (ie requests with > PREEMPT not set) > when FAILFAST is set. Which is clearly wrong. The attached patch fixes > this. Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O errors. >>> I think the problem is the way we treat BLOCKED and QUIESCED (the latter >>> is the state that the domain validation uses and which we cannot kill >>> fastfail on). It's definitely wrong to kill fastfail requests when the >>> state is QUIESCE. >>> >>> This patch (which is applied on top of Hannes original) separates the >>> BLOCK and QUIESCE states correctly ... does this fix the problem? >> >> No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems) > > OK, could you post dmesgs again, please. I actually tested this with an > aic79xx card, and for me it does cause Domain Validation to succeed > again. James, Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch "separates the BLOCK and QUIESCE states correctly" (http://lkml.org/lkml/2007/11/24/8). How to reproduce : - boot - switch to a text console - capture dmesg in a file, sync, etc. There are 3 I/O errors, but the system does work. - switch to X console, log in the Gnome Desktop, the system partially hangs. - switch back to a text console: dmesg(1) still works, it shows some additonal I/O errors. At this point, any disk access makes the system completely hung. Additionnal data: - the I/O errors always happen on the same blocks. -- laurent [0.00] Linux version 2.6.24-rc3-mm1 ([EMAIL PROTECTED]) (gcc version 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)) #122 PREEMPT Fri Nov 23 18:47:58 CET 2007 [0.00] BIOS-provided physical RAM map: [0.00] BIOS-e820: - 0009fc00 (usable) [0.00] BIOS-e820: 0009fc00 - 000a (reserved) [0.00] BIOS-e820: 000f - 0010 (reserved) [0.00] BIOS-e820: 0010 - 1ffec000 (usable) [0.00] BIOS-e820: 1ffec000 - 1ffef000 (ACPI data) [0.00] BIOS-e820: 1ffef000 - 1000 (reserved) [0.00] BIOS-e820: 1000 - 2000 (ACPI NVS) [0.00] BIOS-e820: - 0001 (reserved) [0.00] 511MB LOWMEM available. [0.00] Entering add_active_range(0, 0, 131052) 0 entries of 256 used [0.00] sizeof(struct page) = 32 [0.00] Zone PFN ranges: [0.00] DMA 0 -> 4096 [0.00] Normal 4096 -> 131052 [0.00] Movable zone start PFN for each node [0.00] early_node_map[1] active PFN ranges [0.00] 0:0 -> 131052 [0.00] On node 0 totalpages: 131052 [0.00] Node 0 memmap at 0xC100 size 4194304 first pfn 0xC100 [0.00] DMA zone: 32 pages used for memmap [0.00] DMA zone: 0 pages reserved [0.00] DMA zone: 4064 pages, LIFO batch:0 [0.00] Normal zone: 991 pages used for memmap [0.00] Normal zone: 125965 pages, LIFO batch:31 [0.00] Movable zone: 0 pages used for memmap [0.00] DMI 2.3 present. [0.00] ACPI: RSDP 000F6A80, 0014 (r0 ASUS ) [0.00] ACPI: RSDT 1FFEC000, 002C (r1 ASUS A7V133-C 30303031 MSFT 31313031) [0.00] ACPI: FACP 1FFEC080, 0074 (r1 ASUS A7V133-C 30303031 MSFT 31313031) [0.00] ACPI: DSDT 1FFEC100, 2CE1 (r1 ASUS A7V133-C 1000 MSFT 10B) [0.00] ACPI: FACS 1000, 0040 [0.00] ACPI: BOOT 1FFEC040, 0028 (r1 ASUS A7V133-C 30303031 MSFT 31313031) [0.00] ACPI: PM-Timer IO Port: 0xe408 [0.00] Allocating PCI resources starting at 3000 (gap: 2000:dfff) [0.00] swsusp: Registered nosave memory region: 0009f000 - 000a [0.00] swsusp: Registered nosave memory region: 000a - 000f [0.00] swsusp: Registered nosave memory region: 000f - 0010 [0.00] Built 1 zonelists in Zone order, mobility grouping on. Total pages: 130029 [0.00] Kernel command line: root=/dev/mapper/vglinux1-lv_ubuntu2 ro locale=fr_FR video=radeonfb:[EMAIL PROTECTED] resume=/dev/mapper/vglinux1-lvswap [0.00] Local APIC disabled by BIOS -- you can enable it with "lapic" [0.00] mapped APIC to b000
Re: 2.6.24-rc3-mm1: I/O error, system hangs
Gabriel C wrote: > James Bottomley wrote: >> On Sat, 2007-11-24 at 18:54 +0100, Gabriel C wrote: >>> James Bottomley wrote: On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote: > Le 24.11.2007 07:42, James Bottomley a écrit : >> On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote: >>> Le 23.11.2007 12:38, Hannes Reinecke a écrit : Hannes Reinecke wrote: > Laurent Riffard wrote: >> Le 21.11.2007 23:41, Andrew Morton a écrit : >>> On Wed, 21 Nov 2007 22:45:22 +0100 >>> Laurent Riffard <[EMAIL PROTECTED]> wrote: >>> Le 21.11.2007 05:45, Andrew Morton a écrit : > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ Hello, My system hangs shortly after I logged in Gnome desktop. SysRq-W shows that a bunch of task are blocked in "D" state, they seem to wait for some I/O completion. I can try to hand-copy some data if requested. I found these messages in dmesg: ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 EXT3-fs: mounted filesystem with ordered data mode. sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sda, sector 16460 ReiserFS: sda7: found reiserfs format "3.6" with standard journal ReiserFS: sda7: using ordered data mode -- ReiserFS: sda7: Using r5 hash to sort names sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 19632 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 40037363 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 extents:1 across:1048568k lp0: using parport0 (interrupt-driven). These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible. 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. Maybe something is broken in pata_via driver ? >>> Could be - >>> libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch >>> and >>> pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch >>> touch pata_via.c. >> None of the above... >> >> I did a bisection, it spotted git-scsi-misc.patch. >> I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works >> fine. >> >> I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do >> not >> requeue requests if REQ_FAILFAST is set" is the real culprit. The >> other >> commits are touching documentation or drivers I don't use. I'll try >> to revert only this one this evening. >>> I can confirm : reverting commit >>> 8655a546c83fc43f0a73416bbd126d02de7ad6c0 >>> does fix the problem. >>> > Hmm. Weird. I'll have a look into it. Apparently I'll be returning an > error where > I shouldn't. Checking ... > Ok, found it. We are blocking even special commands (ie requests with PREEMPT not set) when FAILFAST is set. Which is clearly wrong. The attached patch fixes this. >>> Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with >>> I/O errors. >> I think the problem is the way we treat BLOCKED and QUIESCED (the latter >> is the state that the domain validation uses and which we cannot kill >> fastfail on). It's definitely wrong to kill fastfail requests when the >> state is QUIESCE. >> >> This patch (which is applied on top of Hannes original) separates the >> BLOCK and QUIESCE states correctly ... does this fix the problem? > No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems) OK, could you post dmesgs again, please. I actually tested this with an aic79xx card, and for me it does cause Domain Validation to succeed again. >>> Are the patches indeed to fix that problem as well ? >>> >>> http://lkml.org/lkml/2007/11/23/5 >> That dmesg is from an unknown SCSI card exhibiting Domain Validation >> problems, so it's a reasonable probability, yes ... but you'll need the >> additional hack I just did to prevent further intermittent failures. > > My controller is: > > 03:0e.0 SCSI storage controller [0100]: Adaptec AIC-7892P U160/m [9005:008f] > (rev 02) > > I'll try the patches in a bit. With your patches my problem(s) are solved. Domain Validation works again. ... [ 32.179521] scsi
Re: 2.6.24-rc3-mm1: I/O error, system hangs
James Bottomley wrote: > On Sat, 2007-11-24 at 18:54 +0100, Gabriel C wrote: >> James Bottomley wrote: >>> On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote: Le 24.11.2007 07:42, James Bottomley a écrit : > On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote: >> Le 23.11.2007 12:38, Hannes Reinecke a écrit : >>> Hannes Reinecke wrote: Laurent Riffard wrote: > Le 21.11.2007 23:41, Andrew Morton a écrit : >> On Wed, 21 Nov 2007 22:45:22 +0100 >> Laurent Riffard <[EMAIL PROTECTED]> wrote: >> >>> Le 21.11.2007 05:45, Andrew Morton a écrit : ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ >>> Hello, >>> >>> My system hangs shortly after I logged in Gnome desktop. SysRq-W >>> shows >>> that a bunch of task are blocked in "D" state, they seem to wait for >>> some I/O completion. I can try to hand-copy some data if requested. >>> >>> I found these messages in dmesg: >>> >>> ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 >>> EXT3-fs: mounted filesystem with ordered data mode. >>> sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT >>> driverbyte=DRIVER_OK,SUGGEST_OK >>> end_request: I/O error, dev sda, sector 16460 >>> ReiserFS: sda7: found reiserfs format "3.6" with standard journal >>> ReiserFS: sda7: using ordered data mode >>> -- >>> ReiserFS: sda7: Using r5 hash to sort names >>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT >>> driverbyte=DRIVER_OK,SUGGEST_OK >>> end_request: I/O error, dev sdb, sector 19632 >>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT >>> driverbyte=DRIVER_OK,SUGGEST_OK >>> end_request: I/O error, dev sdb, sector 40037363 >>> Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 >>> extents:1 across:1048568k >>> lp0: using parport0 (interrupt-driven). >>> >>> These errors occur *only* with 2.6.24-rc3-mm1, they are 100% >>> reproducible. >>> 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. >>> >>> Maybe something is broken in pata_via driver ? >>> >> Could be - >> libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch >> and >> pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch >> touch pata_via.c. > None of the above... > > I did a bisection, it spotted git-scsi-misc.patch. > I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works > fine. > > I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do > not > requeue requests if REQ_FAILFAST is set" is the real culprit. The > other > commits are touching documentation or drivers I don't use. I'll try > to revert only this one this evening. >> I can confirm : reverting commit >> 8655a546c83fc43f0a73416bbd126d02de7ad6c0 >> does fix the problem. >> Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error where I shouldn't. Checking ... >>> Ok, found it. We are blocking even special commands (ie requests with >>> PREEMPT not set) >>> when FAILFAST is set. Which is clearly wrong. The attached patch fixes >>> this. >> Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O >> errors. > I think the problem is the way we treat BLOCKED and QUIESCED (the latter > is the state that the domain validation uses and which we cannot kill > fastfail on). It's definitely wrong to kill fastfail requests when the > state is QUIESCE. > > This patch (which is applied on top of Hannes original) separates the > BLOCK and QUIESCE states correctly ... does this fix the problem? No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems) >>> OK, could you post dmesgs again, please. I actually tested this with an >>> aic79xx card, and for me it does cause Domain Validation to succeed >>> again. >>> >> Are the patches indeed to fix that problem as well ? >> >> http://lkml.org/lkml/2007/11/23/5 > > That dmesg is from an unknown SCSI card exhibiting Domain Validation > problems, so it's a reasonable probability, yes ... but you'll need the > additional hack I just did to prevent further intermittent failures. My controller is: 03:0e.0 SCSI storage controller [0100]: Adaptec AIC-7892P U160/m [9005:008f] (rev 02) I'll try the patches in a bit. > > James > Gabriel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
Re: 2.6.24-rc3-mm1: I/O error, system hangs
On Sat, 2007-11-24 at 18:54 +0100, Gabriel C wrote: > James Bottomley wrote: > > On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote: > >> Le 24.11.2007 07:42, James Bottomley a écrit : > >>> On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote: > Le 23.11.2007 12:38, Hannes Reinecke a écrit : > > Hannes Reinecke wrote: > >> Laurent Riffard wrote: > >>> Le 21.11.2007 23:41, Andrew Morton a écrit : > On Wed, 21 Nov 2007 22:45:22 +0100 > Laurent Riffard <[EMAIL PROTECTED]> wrote: > > > Le 21.11.2007 05:45, Andrew Morton a écrit : > >> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ > > Hello, > > > > My system hangs shortly after I logged in Gnome desktop. SysRq-W > > shows > > that a bunch of task are blocked in "D" state, they seem to wait for > > some I/O completion. I can try to hand-copy some data if requested. > > > > I found these messages in dmesg: > > > > ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 > > EXT3-fs: mounted filesystem with ordered data mode. > > sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT > > driverbyte=DRIVER_OK,SUGGEST_OK > > end_request: I/O error, dev sda, sector 16460 > > ReiserFS: sda7: found reiserfs format "3.6" with standard journal > > ReiserFS: sda7: using ordered data mode > > -- > > ReiserFS: sda7: Using r5 hash to sort names > > sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT > > driverbyte=DRIVER_OK,SUGGEST_OK > > end_request: I/O error, dev sdb, sector 19632 > > sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT > > driverbyte=DRIVER_OK,SUGGEST_OK > > end_request: I/O error, dev sdb, sector 40037363 > > Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 > > extents:1 across:1048568k > > lp0: using parport0 (interrupt-driven). > > > > These errors occur *only* with 2.6.24-rc3-mm1, they are 100% > > reproducible. > > 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. > > > > Maybe something is broken in pata_via driver ? > > > Could be - > libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch > and > pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch > touch pata_via.c. > >>> None of the above... > >>> > >>> I did a bisection, it spotted git-scsi-misc.patch. > >>> I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works > >>> fine. > >>> > >>> I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do > >>> not > >>> requeue requests if REQ_FAILFAST is set" is the real culprit. The > >>> other > >>> commits are touching documentation or drivers I don't use. I'll try > >>> to revert only this one this evening. > I can confirm : reverting commit > 8655a546c83fc43f0a73416bbd126d02de7ad6c0 > does fix the problem. > > >> Hmm. Weird. I'll have a look into it. Apparently I'll be returning an > >> error where > >> I shouldn't. Checking ... > >> > > Ok, found it. We are blocking even special commands (ie requests with > > PREEMPT not set) > > when FAILFAST is set. Which is clearly wrong. The attached patch fixes > > this. > Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O > errors. > >>> I think the problem is the way we treat BLOCKED and QUIESCED (the latter > >>> is the state that the domain validation uses and which we cannot kill > >>> fastfail on). It's definitely wrong to kill fastfail requests when the > >>> state is QUIESCE. > >>> > >>> This patch (which is applied on top of Hannes original) separates the > >>> BLOCK and QUIESCE states correctly ... does this fix the problem? > >> > >> No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems) > > > > OK, could you post dmesgs again, please. I actually tested this with an > > aic79xx card, and for me it does cause Domain Validation to succeed > > again. > > > > Are the patches indeed to fix that problem as well ? > > http://lkml.org/lkml/2007/11/23/5 That dmesg is from an unknown SCSI card exhibiting Domain Validation problems, so it's a reasonable probability, yes ... but you'll need the additional hack I just did to prevent further intermittent failures. James - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
James Bottomley wrote: > On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote: >> Le 24.11.2007 07:42, James Bottomley a écrit : >>> On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote: Le 23.11.2007 12:38, Hannes Reinecke a écrit : > Hannes Reinecke wrote: >> Laurent Riffard wrote: >>> Le 21.11.2007 23:41, Andrew Morton a écrit : On Wed, 21 Nov 2007 22:45:22 +0100 Laurent Riffard <[EMAIL PROTECTED]> wrote: > Le 21.11.2007 05:45, Andrew Morton a écrit : >> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ > Hello, > > My system hangs shortly after I logged in Gnome desktop. SysRq-W shows > that a bunch of task are blocked in "D" state, they seem to wait for > some I/O completion. I can try to hand-copy some data if requested. > > I found these messages in dmesg: > > ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 > EXT3-fs: mounted filesystem with ordered data mode. > sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT > driverbyte=DRIVER_OK,SUGGEST_OK > end_request: I/O error, dev sda, sector 16460 > ReiserFS: sda7: found reiserfs format "3.6" with standard journal > ReiserFS: sda7: using ordered data mode > -- > ReiserFS: sda7: Using r5 hash to sort names > sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT > driverbyte=DRIVER_OK,SUGGEST_OK > end_request: I/O error, dev sdb, sector 19632 > sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT > driverbyte=DRIVER_OK,SUGGEST_OK > end_request: I/O error, dev sdb, sector 40037363 > Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 > extents:1 across:1048568k > lp0: using parport0 (interrupt-driven). > > These errors occur *only* with 2.6.24-rc3-mm1, they are 100% > reproducible. > 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. > > Maybe something is broken in pata_via driver ? > Could be - libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch and pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch touch pata_via.c. >>> None of the above... >>> >>> I did a bisection, it spotted git-scsi-misc.patch. >>> I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works >>> fine. >>> >>> I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do not >>> requeue requests if REQ_FAILFAST is set" is the real culprit. The other >>> commits are touching documentation or drivers I don't use. I'll try >>> to revert only this one this evening. I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 does fix the problem. >> Hmm. Weird. I'll have a look into it. Apparently I'll be returning an >> error where >> I shouldn't. Checking ... >> > Ok, found it. We are blocking even special commands (ie requests with > PREEMPT not set) > when FAILFAST is set. Which is clearly wrong. The attached patch fixes > this. Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O errors. >>> I think the problem is the way we treat BLOCKED and QUIESCED (the latter >>> is the state that the domain validation uses and which we cannot kill >>> fastfail on). It's definitely wrong to kill fastfail requests when the >>> state is QUIESCE. >>> >>> This patch (which is applied on top of Hannes original) separates the >>> BLOCK and QUIESCE states correctly ... does this fix the problem? >> >> No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems) > > OK, could you post dmesgs again, please. I actually tested this with an > aic79xx card, and for me it does cause Domain Validation to succeed > again. > Are the patches indeed to fix that problem as well ? http://lkml.org/lkml/2007/11/23/5 > James Gabriel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
Probing intermittent failures in Domain Validation, even with the fixes applied leads me to the conclusion that there are further problems with this commit: commit fc5eb4facedbd6d7117905e775cee1975f894e79 Author: Hannes Reinecke <[EMAIL PROTECTED]> Date: Tue Nov 6 09:23:40 2007 +0100 [SCSI] Do not requeue requests if REQ_FAILFAST is set The essence of the problems is that you're causing REQ_FAILFAST to terminate commands with error on requeuing conditions, some of which are relatively common on most SCSI devices. While this may be the correct behaviour for multi-path, it's certainly wrong for the previously understood meaning of REQ_FAILFAST, which was don't retry on error, which is why domain validation and other applications use it to control error handling, but don't expect to get failures for a simple requeue are now spitting errors. I honestly can't see that, even for the multi-path case, returning an error when we're over queue depth is the correct thing to do (it may not matter to something like a symmetrix, but an array that has a non-zero cost associated with a path change, like a CPQ HSV or the AVT controllers, will show fairly large slow downs if you do this). Even if this is the desired behaviour (and I think that's a policy issue), DID_NO_CONNECT is almost certainly the wrong error to be sending back. This patch fixes up domain validation to work again correctly, however, I really think it's just a bandaid. Do you want to rethink the above commit? James Index: BUILD-2.6/drivers/scsi/scsi_lib.c === --- BUILD-2.6.orig/drivers/scsi/scsi_lib.c 2007-11-24 11:25:20.0 -0600 +++ BUILD-2.6/drivers/scsi/scsi_lib.c 2007-11-24 11:26:22.0 -0600 @@ -1552,7 +1552,8 @@ static void scsi_request_fn(struct reque break; if (!scsi_dev_queue_ready(q, sdev)) { - if (req->cmd_flags & REQ_FAILFAST) { + if ((req->cmd_flags & REQ_FAILFAST) && + !(req->cmd_flags & REQ_PREEMPT)) { scsi_kill_request(req, q); continue; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote: > Le 24.11.2007 07:42, James Bottomley a écrit : > > On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote: > >> Le 23.11.2007 12:38, Hannes Reinecke a écrit : > >>> Hannes Reinecke wrote: > Laurent Riffard wrote: > > Le 21.11.2007 23:41, Andrew Morton a écrit : > >> On Wed, 21 Nov 2007 22:45:22 +0100 > >> Laurent Riffard <[EMAIL PROTECTED]> wrote: > >> > >>> Le 21.11.2007 05:45, Andrew Morton a écrit : > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ > >>> Hello, > >>> > >>> My system hangs shortly after I logged in Gnome desktop. SysRq-W shows > >>> that a bunch of task are blocked in "D" state, they seem to wait for > >>> some I/O completion. I can try to hand-copy some data if requested. > >>> > >>> I found these messages in dmesg: > >>> > >>> ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 > >>> EXT3-fs: mounted filesystem with ordered data mode. > >>> sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT > >>> driverbyte=DRIVER_OK,SUGGEST_OK > >>> end_request: I/O error, dev sda, sector 16460 > >>> ReiserFS: sda7: found reiserfs format "3.6" with standard journal > >>> ReiserFS: sda7: using ordered data mode > >>> -- > >>> ReiserFS: sda7: Using r5 hash to sort names > >>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT > >>> driverbyte=DRIVER_OK,SUGGEST_OK > >>> end_request: I/O error, dev sdb, sector 19632 > >>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT > >>> driverbyte=DRIVER_OK,SUGGEST_OK > >>> end_request: I/O error, dev sdb, sector 40037363 > >>> Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 > >>> extents:1 across:1048568k > >>> lp0: using parport0 (interrupt-driven). > >>> > >>> These errors occur *only* with 2.6.24-rc3-mm1, they are 100% > >>> reproducible. > >>> 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. > >>> > >>> Maybe something is broken in pata_via driver ? > >>> > >> Could be - > >> libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch > >> and > >> pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch > >> touch pata_via.c. > > None of the above... > > > > I did a bisection, it spotted git-scsi-misc.patch. > > I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works > > fine. > > > > I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do not > > requeue requests if REQ_FAILFAST is set" is the real culprit. The other > > commits are touching documentation or drivers I don't use. I'll try > > to revert only this one this evening. > >> I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 > >> does fix the problem. > >> > Hmm. Weird. I'll have a look into it. Apparently I'll be returning an > error where > I shouldn't. Checking ... > > >>> Ok, found it. We are blocking even special commands (ie requests with > >>> PREEMPT not set) > >>> when FAILFAST is set. Which is clearly wrong. The attached patch fixes > >>> this. > >> Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O > >> errors. > > > > I think the problem is the way we treat BLOCKED and QUIESCED (the latter > > is the state that the domain validation uses and which we cannot kill > > fastfail on). It's definitely wrong to kill fastfail requests when the > > state is QUIESCE. > > > > This patch (which is applied on top of Hannes original) separates the > > BLOCK and QUIESCE states correctly ... does this fix the problem? > > > No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems) OK, could you post dmesgs again, please. I actually tested this with an aic79xx card, and for me it does cause Domain Validation to succeed again. James - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
Le 24.11.2007 07:42, James Bottomley a écrit : > On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote: >> Le 23.11.2007 12:38, Hannes Reinecke a écrit : >>> Hannes Reinecke wrote: Laurent Riffard wrote: > Le 21.11.2007 23:41, Andrew Morton a écrit : >> On Wed, 21 Nov 2007 22:45:22 +0100 >> Laurent Riffard <[EMAIL PROTECTED]> wrote: >> >>> Le 21.11.2007 05:45, Andrew Morton a écrit : ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ >>> Hello, >>> >>> My system hangs shortly after I logged in Gnome desktop. SysRq-W shows >>> that a bunch of task are blocked in "D" state, they seem to wait for >>> some I/O completion. I can try to hand-copy some data if requested. >>> >>> I found these messages in dmesg: >>> >>> ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 >>> EXT3-fs: mounted filesystem with ordered data mode. >>> sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT >>> driverbyte=DRIVER_OK,SUGGEST_OK >>> end_request: I/O error, dev sda, sector 16460 >>> ReiserFS: sda7: found reiserfs format "3.6" with standard journal >>> ReiserFS: sda7: using ordered data mode >>> -- >>> ReiserFS: sda7: Using r5 hash to sort names >>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT >>> driverbyte=DRIVER_OK,SUGGEST_OK >>> end_request: I/O error, dev sdb, sector 19632 >>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT >>> driverbyte=DRIVER_OK,SUGGEST_OK >>> end_request: I/O error, dev sdb, sector 40037363 >>> Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 >>> extents:1 across:1048568k >>> lp0: using parport0 (interrupt-driven). >>> >>> These errors occur *only* with 2.6.24-rc3-mm1, they are 100% >>> reproducible. >>> 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. >>> >>> Maybe something is broken in pata_via driver ? >>> >> Could be - >> libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch >> and >> pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch >> touch pata_via.c. > None of the above... > > I did a bisection, it spotted git-scsi-misc.patch. > I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine. > > I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do not > requeue requests if REQ_FAILFAST is set" is the real culprit. The other > commits are touching documentation or drivers I don't use. I'll try > to revert only this one this evening. >> I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 >> does fix the problem. >> Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error where I shouldn't. Checking ... >>> Ok, found it. We are blocking even special commands (ie requests with >>> PREEMPT not set) >>> when FAILFAST is set. Which is clearly wrong. The attached patch fixes this. >> Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O >> errors. > > I think the problem is the way we treat BLOCKED and QUIESCED (the latter > is the state that the domain validation uses and which we cannot kill > fastfail on). It's definitely wrong to kill fastfail requests when the > state is QUIESCE. > > This patch (which is applied on top of Hannes original) separates the > BLOCK and QUIESCE states correctly ... does this fix the problem? No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems) > James > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > index 13e7e09..a7cf23a 100644 > --- a/drivers/scsi/scsi_lib.c > +++ b/drivers/scsi/scsi_lib.c > @@ -1279,18 +1279,21 @@ int scsi_prep_state_check(struct scsi_device *sdev, > struct request *req) > "rejecting I/O to dead device\n"); > ret = BLKPREP_KILL; > break; > - case SDEV_QUIESCE: > case SDEV_BLOCK: > /* > - * If the devices is blocked we defer normal commands. > - */ > - if (!(req->cmd_flags & REQ_PREEMPT)) > - ret = BLKPREP_DEFER; > - /* >* Return failfast requests immediately >*/ > if (req->cmd_flags & REQ_FAILFAST) > ret = BLKPREP_KILL; > + > + /* fall through */ > + > + case SDEV_QUIESCE: > + /* > + * If the devices is blocked we defer normal commands. > + */ > + if (!(req->cmd_flags & REQ_PREEMPT)) > + ret = BLKPREP_DEFER; > break; > default: > /* > - To
Re: 2.6.24-rc3-mm1: I/O error, system hangs
Le 24.11.2007 07:42, James Bottomley a écrit : On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote: Le 23.11.2007 12:38, Hannes Reinecke a écrit : Hannes Reinecke wrote: Laurent Riffard wrote: Le 21.11.2007 23:41, Andrew Morton a écrit : On Wed, 21 Nov 2007 22:45:22 +0100 Laurent Riffard [EMAIL PROTECTED] wrote: Le 21.11.2007 05:45, Andrew Morton a écrit : ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ Hello, My system hangs shortly after I logged in Gnome desktop. SysRq-W shows that a bunch of task are blocked in D state, they seem to wait for some I/O completion. I can try to hand-copy some data if requested. I found these messages in dmesg: ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 EXT3-fs: mounted filesystem with ordered data mode. sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sda, sector 16460 ReiserFS: sda7: found reiserfs format 3.6 with standard journal ReiserFS: sda7: using ordered data mode -- ReiserFS: sda7: Using r5 hash to sort names sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 19632 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 40037363 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 extents:1 across:1048568k lp0: using parport0 (interrupt-driven). These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible. 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. Maybe something is broken in pata_via driver ? Could be - libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch and pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch touch pata_via.c. None of the above... I did a bisection, it spotted git-scsi-misc.patch. I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine. I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not requeue requests if REQ_FAILFAST is set is the real culprit. The other commits are touching documentation or drivers I don't use. I'll try to revert only this one this evening. I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 does fix the problem. Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error where I shouldn't. Checking ... Ok, found it. We are blocking even special commands (ie requests with PREEMPT not set) when FAILFAST is set. Which is clearly wrong. The attached patch fixes this. Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O errors. I think the problem is the way we treat BLOCKED and QUIESCED (the latter is the state that the domain validation uses and which we cannot kill fastfail on). It's definitely wrong to kill fastfail requests when the state is QUIESCE. This patch (which is applied on top of Hannes original) separates the BLOCK and QUIESCE states correctly ... does this fix the problem? No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems) James diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 13e7e09..a7cf23a 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1279,18 +1279,21 @@ int scsi_prep_state_check(struct scsi_device *sdev, struct request *req) rejecting I/O to dead device\n); ret = BLKPREP_KILL; break; - case SDEV_QUIESCE: case SDEV_BLOCK: /* - * If the devices is blocked we defer normal commands. - */ - if (!(req-cmd_flags REQ_PREEMPT)) - ret = BLKPREP_DEFER; - /* * Return failfast requests immediately */ if (req-cmd_flags REQ_FAILFAST) ret = BLKPREP_KILL; + + /* fall through */ + + case SDEV_QUIESCE: + /* + * If the devices is blocked we defer normal commands. + */ + if (!(req-cmd_flags REQ_PREEMPT)) + ret = BLKPREP_DEFER; break; default: /* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote: Le 24.11.2007 07:42, James Bottomley a écrit : On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote: Le 23.11.2007 12:38, Hannes Reinecke a écrit : Hannes Reinecke wrote: Laurent Riffard wrote: Le 21.11.2007 23:41, Andrew Morton a écrit : On Wed, 21 Nov 2007 22:45:22 +0100 Laurent Riffard [EMAIL PROTECTED] wrote: Le 21.11.2007 05:45, Andrew Morton a écrit : ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ Hello, My system hangs shortly after I logged in Gnome desktop. SysRq-W shows that a bunch of task are blocked in D state, they seem to wait for some I/O completion. I can try to hand-copy some data if requested. I found these messages in dmesg: ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 EXT3-fs: mounted filesystem with ordered data mode. sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sda, sector 16460 ReiserFS: sda7: found reiserfs format 3.6 with standard journal ReiserFS: sda7: using ordered data mode -- ReiserFS: sda7: Using r5 hash to sort names sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 19632 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 40037363 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 extents:1 across:1048568k lp0: using parport0 (interrupt-driven). These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible. 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. Maybe something is broken in pata_via driver ? Could be - libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch and pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch touch pata_via.c. None of the above... I did a bisection, it spotted git-scsi-misc.patch. I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine. I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not requeue requests if REQ_FAILFAST is set is the real culprit. The other commits are touching documentation or drivers I don't use. I'll try to revert only this one this evening. I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 does fix the problem. Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error where I shouldn't. Checking ... Ok, found it. We are blocking even special commands (ie requests with PREEMPT not set) when FAILFAST is set. Which is clearly wrong. The attached patch fixes this. Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O errors. I think the problem is the way we treat BLOCKED and QUIESCED (the latter is the state that the domain validation uses and which we cannot kill fastfail on). It's definitely wrong to kill fastfail requests when the state is QUIESCE. This patch (which is applied on top of Hannes original) separates the BLOCK and QUIESCE states correctly ... does this fix the problem? No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems) OK, could you post dmesgs again, please. I actually tested this with an aic79xx card, and for me it does cause Domain Validation to succeed again. James - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
Probing intermittent failures in Domain Validation, even with the fixes applied leads me to the conclusion that there are further problems with this commit: commit fc5eb4facedbd6d7117905e775cee1975f894e79 Author: Hannes Reinecke [EMAIL PROTECTED] Date: Tue Nov 6 09:23:40 2007 +0100 [SCSI] Do not requeue requests if REQ_FAILFAST is set The essence of the problems is that you're causing REQ_FAILFAST to terminate commands with error on requeuing conditions, some of which are relatively common on most SCSI devices. While this may be the correct behaviour for multi-path, it's certainly wrong for the previously understood meaning of REQ_FAILFAST, which was don't retry on error, which is why domain validation and other applications use it to control error handling, but don't expect to get failures for a simple requeue are now spitting errors. I honestly can't see that, even for the multi-path case, returning an error when we're over queue depth is the correct thing to do (it may not matter to something like a symmetrix, but an array that has a non-zero cost associated with a path change, like a CPQ HSV or the AVT controllers, will show fairly large slow downs if you do this). Even if this is the desired behaviour (and I think that's a policy issue), DID_NO_CONNECT is almost certainly the wrong error to be sending back. This patch fixes up domain validation to work again correctly, however, I really think it's just a bandaid. Do you want to rethink the above commit? James Index: BUILD-2.6/drivers/scsi/scsi_lib.c === --- BUILD-2.6.orig/drivers/scsi/scsi_lib.c 2007-11-24 11:25:20.0 -0600 +++ BUILD-2.6/drivers/scsi/scsi_lib.c 2007-11-24 11:26:22.0 -0600 @@ -1552,7 +1552,8 @@ static void scsi_request_fn(struct reque break; if (!scsi_dev_queue_ready(q, sdev)) { - if (req-cmd_flags REQ_FAILFAST) { + if ((req-cmd_flags REQ_FAILFAST) + !(req-cmd_flags REQ_PREEMPT)) { scsi_kill_request(req, q); continue; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
James Bottomley wrote: On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote: Le 24.11.2007 07:42, James Bottomley a écrit : On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote: Le 23.11.2007 12:38, Hannes Reinecke a écrit : Hannes Reinecke wrote: Laurent Riffard wrote: Le 21.11.2007 23:41, Andrew Morton a écrit : On Wed, 21 Nov 2007 22:45:22 +0100 Laurent Riffard [EMAIL PROTECTED] wrote: Le 21.11.2007 05:45, Andrew Morton a écrit : ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ Hello, My system hangs shortly after I logged in Gnome desktop. SysRq-W shows that a bunch of task are blocked in D state, they seem to wait for some I/O completion. I can try to hand-copy some data if requested. I found these messages in dmesg: ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 EXT3-fs: mounted filesystem with ordered data mode. sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sda, sector 16460 ReiserFS: sda7: found reiserfs format 3.6 with standard journal ReiserFS: sda7: using ordered data mode -- ReiserFS: sda7: Using r5 hash to sort names sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 19632 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 40037363 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 extents:1 across:1048568k lp0: using parport0 (interrupt-driven). These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible. 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. Maybe something is broken in pata_via driver ? Could be - libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch and pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch touch pata_via.c. None of the above... I did a bisection, it spotted git-scsi-misc.patch. I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine. I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not requeue requests if REQ_FAILFAST is set is the real culprit. The other commits are touching documentation or drivers I don't use. I'll try to revert only this one this evening. I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 does fix the problem. Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error where I shouldn't. Checking ... Ok, found it. We are blocking even special commands (ie requests with PREEMPT not set) when FAILFAST is set. Which is clearly wrong. The attached patch fixes this. Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O errors. I think the problem is the way we treat BLOCKED and QUIESCED (the latter is the state that the domain validation uses and which we cannot kill fastfail on). It's definitely wrong to kill fastfail requests when the state is QUIESCE. This patch (which is applied on top of Hannes original) separates the BLOCK and QUIESCE states correctly ... does this fix the problem? No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems) OK, could you post dmesgs again, please. I actually tested this with an aic79xx card, and for me it does cause Domain Validation to succeed again. Are the patches indeed to fix that problem as well ? http://lkml.org/lkml/2007/11/23/5 James Gabriel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
On Sat, 2007-11-24 at 18:54 +0100, Gabriel C wrote: James Bottomley wrote: On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote: Le 24.11.2007 07:42, James Bottomley a écrit : On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote: Le 23.11.2007 12:38, Hannes Reinecke a écrit : Hannes Reinecke wrote: Laurent Riffard wrote: Le 21.11.2007 23:41, Andrew Morton a écrit : On Wed, 21 Nov 2007 22:45:22 +0100 Laurent Riffard [EMAIL PROTECTED] wrote: Le 21.11.2007 05:45, Andrew Morton a écrit : ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ Hello, My system hangs shortly after I logged in Gnome desktop. SysRq-W shows that a bunch of task are blocked in D state, they seem to wait for some I/O completion. I can try to hand-copy some data if requested. I found these messages in dmesg: ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 EXT3-fs: mounted filesystem with ordered data mode. sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sda, sector 16460 ReiserFS: sda7: found reiserfs format 3.6 with standard journal ReiserFS: sda7: using ordered data mode -- ReiserFS: sda7: Using r5 hash to sort names sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 19632 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 40037363 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 extents:1 across:1048568k lp0: using parport0 (interrupt-driven). These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible. 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. Maybe something is broken in pata_via driver ? Could be - libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch and pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch touch pata_via.c. None of the above... I did a bisection, it spotted git-scsi-misc.patch. I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine. I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not requeue requests if REQ_FAILFAST is set is the real culprit. The other commits are touching documentation or drivers I don't use. I'll try to revert only this one this evening. I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 does fix the problem. Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error where I shouldn't. Checking ... Ok, found it. We are blocking even special commands (ie requests with PREEMPT not set) when FAILFAST is set. Which is clearly wrong. The attached patch fixes this. Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O errors. I think the problem is the way we treat BLOCKED and QUIESCED (the latter is the state that the domain validation uses and which we cannot kill fastfail on). It's definitely wrong to kill fastfail requests when the state is QUIESCE. This patch (which is applied on top of Hannes original) separates the BLOCK and QUIESCE states correctly ... does this fix the problem? No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems) OK, could you post dmesgs again, please. I actually tested this with an aic79xx card, and for me it does cause Domain Validation to succeed again. Are the patches indeed to fix that problem as well ? http://lkml.org/lkml/2007/11/23/5 That dmesg is from an unknown SCSI card exhibiting Domain Validation problems, so it's a reasonable probability, yes ... but you'll need the additional hack I just did to prevent further intermittent failures. James - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
James Bottomley wrote: On Sat, 2007-11-24 at 18:54 +0100, Gabriel C wrote: James Bottomley wrote: On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote: Le 24.11.2007 07:42, James Bottomley a écrit : On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote: Le 23.11.2007 12:38, Hannes Reinecke a écrit : Hannes Reinecke wrote: Laurent Riffard wrote: Le 21.11.2007 23:41, Andrew Morton a écrit : On Wed, 21 Nov 2007 22:45:22 +0100 Laurent Riffard [EMAIL PROTECTED] wrote: Le 21.11.2007 05:45, Andrew Morton a écrit : ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ Hello, My system hangs shortly after I logged in Gnome desktop. SysRq-W shows that a bunch of task are blocked in D state, they seem to wait for some I/O completion. I can try to hand-copy some data if requested. I found these messages in dmesg: ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 EXT3-fs: mounted filesystem with ordered data mode. sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sda, sector 16460 ReiserFS: sda7: found reiserfs format 3.6 with standard journal ReiserFS: sda7: using ordered data mode -- ReiserFS: sda7: Using r5 hash to sort names sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 19632 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 40037363 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 extents:1 across:1048568k lp0: using parport0 (interrupt-driven). These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible. 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. Maybe something is broken in pata_via driver ? Could be - libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch and pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch touch pata_via.c. None of the above... I did a bisection, it spotted git-scsi-misc.patch. I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine. I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not requeue requests if REQ_FAILFAST is set is the real culprit. The other commits are touching documentation or drivers I don't use. I'll try to revert only this one this evening. I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 does fix the problem. Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error where I shouldn't. Checking ... Ok, found it. We are blocking even special commands (ie requests with PREEMPT not set) when FAILFAST is set. Which is clearly wrong. The attached patch fixes this. Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O errors. I think the problem is the way we treat BLOCKED and QUIESCED (the latter is the state that the domain validation uses and which we cannot kill fastfail on). It's definitely wrong to kill fastfail requests when the state is QUIESCE. This patch (which is applied on top of Hannes original) separates the BLOCK and QUIESCE states correctly ... does this fix the problem? No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems) OK, could you post dmesgs again, please. I actually tested this with an aic79xx card, and for me it does cause Domain Validation to succeed again. Are the patches indeed to fix that problem as well ? http://lkml.org/lkml/2007/11/23/5 That dmesg is from an unknown SCSI card exhibiting Domain Validation problems, so it's a reasonable probability, yes ... but you'll need the additional hack I just did to prevent further intermittent failures. My controller is: 03:0e.0 SCSI storage controller [0100]: Adaptec AIC-7892P U160/m [9005:008f] (rev 02) I'll try the patches in a bit. James Gabriel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
Gabriel C wrote: James Bottomley wrote: On Sat, 2007-11-24 at 18:54 +0100, Gabriel C wrote: James Bottomley wrote: On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote: Le 24.11.2007 07:42, James Bottomley a écrit : On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote: Le 23.11.2007 12:38, Hannes Reinecke a écrit : Hannes Reinecke wrote: Laurent Riffard wrote: Le 21.11.2007 23:41, Andrew Morton a écrit : On Wed, 21 Nov 2007 22:45:22 +0100 Laurent Riffard [EMAIL PROTECTED] wrote: Le 21.11.2007 05:45, Andrew Morton a écrit : ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ Hello, My system hangs shortly after I logged in Gnome desktop. SysRq-W shows that a bunch of task are blocked in D state, they seem to wait for some I/O completion. I can try to hand-copy some data if requested. I found these messages in dmesg: ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 EXT3-fs: mounted filesystem with ordered data mode. sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sda, sector 16460 ReiserFS: sda7: found reiserfs format 3.6 with standard journal ReiserFS: sda7: using ordered data mode -- ReiserFS: sda7: Using r5 hash to sort names sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 19632 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 40037363 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 extents:1 across:1048568k lp0: using parport0 (interrupt-driven). These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible. 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. Maybe something is broken in pata_via driver ? Could be - libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch and pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch touch pata_via.c. None of the above... I did a bisection, it spotted git-scsi-misc.patch. I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine. I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not requeue requests if REQ_FAILFAST is set is the real culprit. The other commits are touching documentation or drivers I don't use. I'll try to revert only this one this evening. I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 does fix the problem. Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error where I shouldn't. Checking ... Ok, found it. We are blocking even special commands (ie requests with PREEMPT not set) when FAILFAST is set. Which is clearly wrong. The attached patch fixes this. Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O errors. I think the problem is the way we treat BLOCKED and QUIESCED (the latter is the state that the domain validation uses and which we cannot kill fastfail on). It's definitely wrong to kill fastfail requests when the state is QUIESCE. This patch (which is applied on top of Hannes original) separates the BLOCK and QUIESCE states correctly ... does this fix the problem? No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems) OK, could you post dmesgs again, please. I actually tested this with an aic79xx card, and for me it does cause Domain Validation to succeed again. Are the patches indeed to fix that problem as well ? http://lkml.org/lkml/2007/11/23/5 That dmesg is from an unknown SCSI card exhibiting Domain Validation problems, so it's a reasonable probability, yes ... but you'll need the additional hack I just did to prevent further intermittent failures. My controller is: 03:0e.0 SCSI storage controller [0100]: Adaptec AIC-7892P U160/m [9005:008f] (rev 02) I'll try the patches in a bit. With your patches my problem(s) are solved. Domain Validation works again. ... [ 32.179521] scsi 0:0:0:0: Direct-Access SEAGATE ST318406LW 0109 PQ: 0 ANSI: 3 [ 32.179540] scsi0:A:0:0: Tagged Queuing enabled. Depth 32 [ 32.179554] target0:0:0: Beginning Domain Validation [ 32.188553] target0:0:0: wide asynchronous [ 32.195302] target0:0:0: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 63) [ 32.206510] target0:0:0: Ending Domain Validation [ 32.211699] scsi 0:0:1:0: Direct-Access FUJITSU MAH3182MP0114 PQ: 0 ANSI: 4 [ 32.211707] scsi0:A:1:0: Tagged Queuing enabled. Depth 32 [ 32.211717] target0:0:1: Beginning Domain Validation [ 32.213980] target0:0:1: wide asynchronous [ 32.215682] target0:0:1: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 127) [ 32.220205] target0:0:1: Ending Domain Validation ... Thx James :) Gabriel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL
Re: 2.6.24-rc3-mm1: I/O error, system hangs
Le 24.11.2007 14:26, James Bottomley a écrit : On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote: Le 24.11.2007 07:42, James Bottomley a écrit : On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote: Le 23.11.2007 12:38, Hannes Reinecke a écrit : [snip] I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 does fix the problem. Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error where I shouldn't. Checking ... Ok, found it. We are blocking even special commands (ie requests with PREEMPT not set) when FAILFAST is set. Which is clearly wrong. The attached patch fixes this. Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O errors. I think the problem is the way we treat BLOCKED and QUIESCED (the latter is the state that the domain validation uses and which we cannot kill fastfail on). It's definitely wrong to kill fastfail requests when the state is QUIESCE. This patch (which is applied on top of Hannes original) separates the BLOCK and QUIESCE states correctly ... does this fix the problem? No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems) OK, could you post dmesgs again, please. I actually tested this with an aic79xx card, and for me it does cause Domain Validation to succeed again. James, Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch separates the BLOCK and QUIESCE states correctly (http://lkml.org/lkml/2007/11/24/8). How to reproduce : - boot - switch to a text console - capture dmesg in a file, sync, etc. There are 3 I/O errors, but the system does work. - switch to X console, log in the Gnome Desktop, the system partially hangs. - switch back to a text console: dmesg(1) still works, it shows some additonal I/O errors. At this point, any disk access makes the system completely hung. Additionnal data: - the I/O errors always happen on the same blocks. -- laurent [0.00] Linux version 2.6.24-rc3-mm1 ([EMAIL PROTECTED]) (gcc version 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)) #122 PREEMPT Fri Nov 23 18:47:58 CET 2007 [0.00] BIOS-provided physical RAM map: [0.00] BIOS-e820: - 0009fc00 (usable) [0.00] BIOS-e820: 0009fc00 - 000a (reserved) [0.00] BIOS-e820: 000f - 0010 (reserved) [0.00] BIOS-e820: 0010 - 1ffec000 (usable) [0.00] BIOS-e820: 1ffec000 - 1ffef000 (ACPI data) [0.00] BIOS-e820: 1ffef000 - 1000 (reserved) [0.00] BIOS-e820: 1000 - 2000 (ACPI NVS) [0.00] BIOS-e820: - 0001 (reserved) [0.00] 511MB LOWMEM available. [0.00] Entering add_active_range(0, 0, 131052) 0 entries of 256 used [0.00] sizeof(struct page) = 32 [0.00] Zone PFN ranges: [0.00] DMA 0 - 4096 [0.00] Normal 4096 - 131052 [0.00] Movable zone start PFN for each node [0.00] early_node_map[1] active PFN ranges [0.00] 0:0 - 131052 [0.00] On node 0 totalpages: 131052 [0.00] Node 0 memmap at 0xC100 size 4194304 first pfn 0xC100 [0.00] DMA zone: 32 pages used for memmap [0.00] DMA zone: 0 pages reserved [0.00] DMA zone: 4064 pages, LIFO batch:0 [0.00] Normal zone: 991 pages used for memmap [0.00] Normal zone: 125965 pages, LIFO batch:31 [0.00] Movable zone: 0 pages used for memmap [0.00] DMI 2.3 present. [0.00] ACPI: RSDP 000F6A80, 0014 (r0 ASUS ) [0.00] ACPI: RSDT 1FFEC000, 002C (r1 ASUS A7V133-C 30303031 MSFT 31313031) [0.00] ACPI: FACP 1FFEC080, 0074 (r1 ASUS A7V133-C 30303031 MSFT 31313031) [0.00] ACPI: DSDT 1FFEC100, 2CE1 (r1 ASUS A7V133-C 1000 MSFT 10B) [0.00] ACPI: FACS 1000, 0040 [0.00] ACPI: BOOT 1FFEC040, 0028 (r1 ASUS A7V133-C 30303031 MSFT 31313031) [0.00] ACPI: PM-Timer IO Port: 0xe408 [0.00] Allocating PCI resources starting at 3000 (gap: 2000:dfff) [0.00] swsusp: Registered nosave memory region: 0009f000 - 000a [0.00] swsusp: Registered nosave memory region: 000a - 000f [0.00] swsusp: Registered nosave memory region: 000f - 0010 [0.00] Built 1 zonelists in Zone order, mobility grouping on. Total pages: 130029 [0.00] Kernel command line: root=/dev/mapper/vglinux1-lv_ubuntu2 ro locale=fr_FR video=radeonfb:[EMAIL PROTECTED] resume=/dev/mapper/vglinux1-lvswap [0.00] Local APIC disabled by BIOS -- you can enable it with lapic [0.00] mapped APIC to b000 (01406000) [0.00] Enabling fast FPU save and restore... done. [0.00] Enabling unmasked SIMD FPU
Re: 2.6.24-rc3-mm1: I/O error, system hangs
On Sat, 2007-11-24 at 23:59 +0100, Laurent Riffard wrote: Le 24.11.2007 14:26, James Bottomley a écrit : OK, could you post dmesgs again, please. I actually tested this with an aic79xx card, and for me it does cause Domain Validation to succeed again. James, Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch separates the BLOCK and QUIESCE states correctly (http://lkml.org/lkml/2007/11/24/8). How to reproduce : - boot - switch to a text console - capture dmesg in a file, sync, etc. There are 3 I/O errors, but the system does work. - switch to X console, log in the Gnome Desktop, the system partially hangs. - switch back to a text console: dmesg(1) still works, it shows some additonal I/O errors. At this point, any disk access makes the system completely hung. Additionnal data: - the I/O errors always happen on the same blocks. plain text document attachment (dmesg-2.6.24-rc3-mm1-patched) [...] [ 25.521256] scsi0 : pata_via [ 25.521711] scsi1 : pata_via [ 25.524089] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xb800 irq 14 [ 25.524176] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xb808 irq 15 [ 25.683141] ata1.00: ATA-5: ST340016A, 3.75, max UDMA/100 [ 25.683208] ata1.00: 78165360 sectors, multi 16: LBA [ 25.683475] ata1.01: ATA-7: Maxtor 6Y080L0, YAR41BW0, max UDMA/133 [ 25.684116] ata1.01: 160086528 sectors, multi 16: LBA [ 25.691127] ata1.00: configured for UDMA/100 [ 25.699142] ata1.01: configured for UDMA/100 [ 26.170860] ata2.00: ATAPI: HL-DT-ST DVDRAM GSA-4165B, DL05, max UDMA/33 [ 26.171562] ata2.01: ATAPI: CD-950E/AKU, A4Q, max MWDMA2, CDB intr [ 26.330839] ata2.00: configured for UDMA/33 [ 26.490828] ata2.01: configured for MWDMA2 [ 26.503014] scsi 0:0:0:0: Direct-Access ATA ST340016A 3.75 PQ: 0 ANSI: 5 [ 26.504670] scsi 0:0:1:0: Direct-Access ATA Maxtor 6Y080L0 YAR4 PQ: 0 ANSI: 5 [ 26.509842] scsi 1:0:0:0: CD-ROMHL-DT-ST DVDRAM GSA-4165B DL05 PQ: 0 ANSI: 5 [ 26.511673] scsi 1:0:1:0: CD-ROME-IDECD-950E/AKU A4Q PQ: 0 ANSI: 5 [...] [ 60.216113] sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK [ 60.216124] end_request: I/O error, dev sda, sector 16460 I think this one's quite easy: PATA devices in libata are queue depth 1 (since they don't do NCQ). Thus, they're peculiarly sensitive to the bug where we fail over queue depth requests. On the other hand, I don't see how a filesystem request is getting REQ_FAILFAST ... unless there's a bio or readahead issue involved. Anyway, could you try this patch: http://marc.info/?l=linux-scsim=119592627425498 Which should fix the queue depth issue, and see if the errors go away? Thanks, James - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote: > Le 23.11.2007 12:38, Hannes Reinecke a écrit : > > Hannes Reinecke wrote: > >> Laurent Riffard wrote: > >>> Le 21.11.2007 23:41, Andrew Morton a écrit : > On Wed, 21 Nov 2007 22:45:22 +0100 > Laurent Riffard <[EMAIL PROTECTED]> wrote: > > > Le 21.11.2007 05:45, Andrew Morton a écrit : > >> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ > > Hello, > > > > My system hangs shortly after I logged in Gnome desktop. SysRq-W shows > > that a bunch of task are blocked in "D" state, they seem to wait for > > some I/O completion. I can try to hand-copy some data if requested. > > > > I found these messages in dmesg: > > > > ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 > > EXT3-fs: mounted filesystem with ordered data mode. > > sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT > > driverbyte=DRIVER_OK,SUGGEST_OK > > end_request: I/O error, dev sda, sector 16460 > > ReiserFS: sda7: found reiserfs format "3.6" with standard journal > > ReiserFS: sda7: using ordered data mode > > -- > > ReiserFS: sda7: Using r5 hash to sort names > > sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT > > driverbyte=DRIVER_OK,SUGGEST_OK > > end_request: I/O error, dev sdb, sector 19632 > > sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT > > driverbyte=DRIVER_OK,SUGGEST_OK > > end_request: I/O error, dev sdb, sector 40037363 > > Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 > > extents:1 across:1048568k > > lp0: using parport0 (interrupt-driven). > > > > These errors occur *only* with 2.6.24-rc3-mm1, they are 100% > > reproducible. > > 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. > > > > Maybe something is broken in pata_via driver ? > > > Could be - > libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch > and > pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch > touch pata_via.c. > >>> None of the above... > >>> > >>> I did a bisection, it spotted git-scsi-misc.patch. > >>> I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine. > >>> > >>> I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do not > >>> requeue requests if REQ_FAILFAST is set" is the real culprit. The other > >>> commits are touching documentation or drivers I don't use. I'll try > >>> to revert only this one this evening. > > I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 > does fix the problem. > > >> Hmm. Weird. I'll have a look into it. Apparently I'll be returning an > >> error where > >> I shouldn't. Checking ... > >> > > Ok, found it. We are blocking even special commands (ie requests with > > PREEMPT not set) > > when FAILFAST is set. Which is clearly wrong. The attached patch fixes this. > > Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O > errors. I think the problem is the way we treat BLOCKED and QUIESCED (the latter is the state that the domain validation uses and which we cannot kill fastfail on). It's definitely wrong to kill fastfail requests when the state is QUIESCE. This patch (which is applied on top of Hannes original) separates the BLOCK and QUIESCE states correctly ... does this fix the problem? James diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 13e7e09..a7cf23a 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1279,18 +1279,21 @@ int scsi_prep_state_check(struct scsi_device *sdev, struct request *req) "rejecting I/O to dead device\n"); ret = BLKPREP_KILL; break; - case SDEV_QUIESCE: case SDEV_BLOCK: /* -* If the devices is blocked we defer normal commands. -*/ - if (!(req->cmd_flags & REQ_PREEMPT)) - ret = BLKPREP_DEFER; - /* * Return failfast requests immediately */ if (req->cmd_flags & REQ_FAILFAST) ret = BLKPREP_KILL; + + /* fall through */ + + case SDEV_QUIESCE: + /* +* If the devices is blocked we defer normal commands. +*/ + if (!(req->cmd_flags & REQ_PREEMPT)) + ret = BLKPREP_DEFER; break; default: /* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at
Re: 2.6.24-rc3-mm1: I/O error, system hangs
Le 23.11.2007 12:38, Hannes Reinecke a écrit : > Hannes Reinecke wrote: >> Laurent Riffard wrote: >>> Le 21.11.2007 23:41, Andrew Morton a écrit : On Wed, 21 Nov 2007 22:45:22 +0100 Laurent Riffard <[EMAIL PROTECTED]> wrote: > Le 21.11.2007 05:45, Andrew Morton a écrit : >> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ > Hello, > > My system hangs shortly after I logged in Gnome desktop. SysRq-W shows > that a bunch of task are blocked in "D" state, they seem to wait for > some I/O completion. I can try to hand-copy some data if requested. > > I found these messages in dmesg: > > ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 > EXT3-fs: mounted filesystem with ordered data mode. > sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT > driverbyte=DRIVER_OK,SUGGEST_OK > end_request: I/O error, dev sda, sector 16460 > ReiserFS: sda7: found reiserfs format "3.6" with standard journal > ReiserFS: sda7: using ordered data mode > -- > ReiserFS: sda7: Using r5 hash to sort names > sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT > driverbyte=DRIVER_OK,SUGGEST_OK > end_request: I/O error, dev sdb, sector 19632 > sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT > driverbyte=DRIVER_OK,SUGGEST_OK > end_request: I/O error, dev sdb, sector 40037363 > Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 > extents:1 across:1048568k > lp0: using parport0 (interrupt-driven). > > These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible. > 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. > > Maybe something is broken in pata_via driver ? > Could be - libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch and pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch touch pata_via.c. >>> None of the above... >>> >>> I did a bisection, it spotted git-scsi-misc.patch. >>> I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine. >>> >>> I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do not >>> requeue requests if REQ_FAILFAST is set" is the real culprit. The other >>> commits are touching documentation or drivers I don't use. I'll try >>> to revert only this one this evening. I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 does fix the problem. >> Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error >> where >> I shouldn't. Checking ... >> > Ok, found it. We are blocking even special commands (ie requests with PREEMPT > not set) > when FAILFAST is set. Which is clearly wrong. The attached patch fixes this. Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O errors. -- laurent - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
Hannes Reinecke wrote: > Laurent Riffard wrote: >> Le 21.11.2007 23:41, Andrew Morton a écrit : >>> On Wed, 21 Nov 2007 22:45:22 +0100 >>> Laurent Riffard <[EMAIL PROTECTED]> wrote: >>> Le 21.11.2007 05:45, Andrew Morton a écrit : > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ Hello, My system hangs shortly after I logged in Gnome desktop. SysRq-W shows that a bunch of task are blocked in "D" state, they seem to wait for some I/O completion. I can try to hand-copy some data if requested. I found these messages in dmesg: ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 EXT3-fs: mounted filesystem with ordered data mode. sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sda, sector 16460 ReiserFS: sda7: found reiserfs format "3.6" with standard journal ReiserFS: sda7: using ordered data mode -- ReiserFS: sda7: Using r5 hash to sort names sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 19632 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 40037363 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 extents:1 across:1048568k lp0: using parport0 (interrupt-driven). These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible. 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. Maybe something is broken in pata_via driver ? >>> Could be - >>> libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch >>> and >>> pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch >>> touch pata_via.c. >> None of the above... >> >> I did a bisection, it spotted git-scsi-misc.patch. >> I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine. >> >> I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do not >> requeue requests if REQ_FAILFAST is set" is the real culprit. The other >> commits are touching documentation or drivers I don't use. I'll try >> to revert only this one this evening. >> > Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error > where > I shouldn't. Checking ... > Ok, found it. We are blocking even special commands (ie requests with PREEMPT not set) when FAILFAST is set. Which is clearly wrong. The attached patch fixes this. James, please apply. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage [EMAIL PROTECTED] +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) Fix SPI Domain validation This fixes a thinko of the FAILFAST handling: when we get a request with FAILFAST set, we still have to evaluate the PREEMPT flag to decide if this request should be passed through. Signed-off-by: Hannes Reinecke <[EMAIL PROTECTED]> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 13e7e09..9ec1566 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1284,13 +1284,15 @@ int scsi_prep_state_check(struct scsi_device *sdev, struct request *req) /* * If the devices is blocked we defer normal commands. */ - if (!(req->cmd_flags & REQ_PREEMPT)) - ret = BLKPREP_DEFER; - /* -* Return failfast requests immediately -*/ - if (req->cmd_flags & REQ_FAILFAST) - ret = BLKPREP_KILL; + if (!(req->cmd_flags & REQ_PREEMPT)) { + /* +* Return failfast requests immediately +*/ + if (req->cmd_flags & REQ_FAILFAST) + ret = BLKPREP_KILL; + else + ret = BLKPREP_DEFER; + } break; default: /*
Re: 2.6.24-rc3-mm1: I/O error, system hangs
Le 23.11.2007 12:38, Hannes Reinecke a écrit : Hannes Reinecke wrote: Laurent Riffard wrote: Le 21.11.2007 23:41, Andrew Morton a écrit : On Wed, 21 Nov 2007 22:45:22 +0100 Laurent Riffard [EMAIL PROTECTED] wrote: Le 21.11.2007 05:45, Andrew Morton a écrit : ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ Hello, My system hangs shortly after I logged in Gnome desktop. SysRq-W shows that a bunch of task are blocked in D state, they seem to wait for some I/O completion. I can try to hand-copy some data if requested. I found these messages in dmesg: ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 EXT3-fs: mounted filesystem with ordered data mode. sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sda, sector 16460 ReiserFS: sda7: found reiserfs format 3.6 with standard journal ReiserFS: sda7: using ordered data mode -- ReiserFS: sda7: Using r5 hash to sort names sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 19632 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 40037363 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 extents:1 across:1048568k lp0: using parport0 (interrupt-driven). These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible. 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. Maybe something is broken in pata_via driver ? Could be - libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch and pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch touch pata_via.c. None of the above... I did a bisection, it spotted git-scsi-misc.patch. I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine. I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not requeue requests if REQ_FAILFAST is set is the real culprit. The other commits are touching documentation or drivers I don't use. I'll try to revert only this one this evening. I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 does fix the problem. Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error where I shouldn't. Checking ... Ok, found it. We are blocking even special commands (ie requests with PREEMPT not set) when FAILFAST is set. Which is clearly wrong. The attached patch fixes this. Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O errors. -- laurent - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
Hannes Reinecke wrote: Laurent Riffard wrote: Le 21.11.2007 23:41, Andrew Morton a écrit : On Wed, 21 Nov 2007 22:45:22 +0100 Laurent Riffard [EMAIL PROTECTED] wrote: Le 21.11.2007 05:45, Andrew Morton a écrit : ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ Hello, My system hangs shortly after I logged in Gnome desktop. SysRq-W shows that a bunch of task are blocked in D state, they seem to wait for some I/O completion. I can try to hand-copy some data if requested. I found these messages in dmesg: ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 EXT3-fs: mounted filesystem with ordered data mode. sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sda, sector 16460 ReiserFS: sda7: found reiserfs format 3.6 with standard journal ReiserFS: sda7: using ordered data mode -- ReiserFS: sda7: Using r5 hash to sort names sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 19632 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 40037363 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 extents:1 across:1048568k lp0: using parport0 (interrupt-driven). These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible. 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. Maybe something is broken in pata_via driver ? Could be - libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch and pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch touch pata_via.c. None of the above... I did a bisection, it spotted git-scsi-misc.patch. I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine. I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not requeue requests if REQ_FAILFAST is set is the real culprit. The other commits are touching documentation or drivers I don't use. I'll try to revert only this one this evening. Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error where I shouldn't. Checking ... Ok, found it. We are blocking even special commands (ie requests with PREEMPT not set) when FAILFAST is set. Which is clearly wrong. The attached patch fixes this. James, please apply. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage [EMAIL PROTECTED] +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) Fix SPI Domain validation This fixes a thinko of the FAILFAST handling: when we get a request with FAILFAST set, we still have to evaluate the PREEMPT flag to decide if this request should be passed through. Signed-off-by: Hannes Reinecke [EMAIL PROTECTED] diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 13e7e09..9ec1566 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1284,13 +1284,15 @@ int scsi_prep_state_check(struct scsi_device *sdev, struct request *req) /* * If the devices is blocked we defer normal commands. */ - if (!(req-cmd_flags REQ_PREEMPT)) - ret = BLKPREP_DEFER; - /* -* Return failfast requests immediately -*/ - if (req-cmd_flags REQ_FAILFAST) - ret = BLKPREP_KILL; + if (!(req-cmd_flags REQ_PREEMPT)) { + /* +* Return failfast requests immediately +*/ + if (req-cmd_flags REQ_FAILFAST) + ret = BLKPREP_KILL; + else + ret = BLKPREP_DEFER; + } break; default: /*
Re: 2.6.24-rc3-mm1: I/O error, system hangs
On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote: Le 23.11.2007 12:38, Hannes Reinecke a écrit : Hannes Reinecke wrote: Laurent Riffard wrote: Le 21.11.2007 23:41, Andrew Morton a écrit : On Wed, 21 Nov 2007 22:45:22 +0100 Laurent Riffard [EMAIL PROTECTED] wrote: Le 21.11.2007 05:45, Andrew Morton a écrit : ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ Hello, My system hangs shortly after I logged in Gnome desktop. SysRq-W shows that a bunch of task are blocked in D state, they seem to wait for some I/O completion. I can try to hand-copy some data if requested. I found these messages in dmesg: ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 EXT3-fs: mounted filesystem with ordered data mode. sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sda, sector 16460 ReiserFS: sda7: found reiserfs format 3.6 with standard journal ReiserFS: sda7: using ordered data mode -- ReiserFS: sda7: Using r5 hash to sort names sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 19632 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 40037363 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 extents:1 across:1048568k lp0: using parport0 (interrupt-driven). These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible. 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. Maybe something is broken in pata_via driver ? Could be - libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch and pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch touch pata_via.c. None of the above... I did a bisection, it spotted git-scsi-misc.patch. I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine. I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not requeue requests if REQ_FAILFAST is set is the real culprit. The other commits are touching documentation or drivers I don't use. I'll try to revert only this one this evening. I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 does fix the problem. Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error where I shouldn't. Checking ... Ok, found it. We are blocking even special commands (ie requests with PREEMPT not set) when FAILFAST is set. Which is clearly wrong. The attached patch fixes this. Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O errors. I think the problem is the way we treat BLOCKED and QUIESCED (the latter is the state that the domain validation uses and which we cannot kill fastfail on). It's definitely wrong to kill fastfail requests when the state is QUIESCE. This patch (which is applied on top of Hannes original) separates the BLOCK and QUIESCE states correctly ... does this fix the problem? James diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 13e7e09..a7cf23a 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1279,18 +1279,21 @@ int scsi_prep_state_check(struct scsi_device *sdev, struct request *req) rejecting I/O to dead device\n); ret = BLKPREP_KILL; break; - case SDEV_QUIESCE: case SDEV_BLOCK: /* -* If the devices is blocked we defer normal commands. -*/ - if (!(req-cmd_flags REQ_PREEMPT)) - ret = BLKPREP_DEFER; - /* * Return failfast requests immediately */ if (req-cmd_flags REQ_FAILFAST) ret = BLKPREP_KILL; + + /* fall through */ + + case SDEV_QUIESCE: + /* +* If the devices is blocked we defer normal commands. +*/ + if (!(req-cmd_flags REQ_PREEMPT)) + ret = BLKPREP_DEFER; break; default: /* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
Laurent Riffard wrote: > Le 21.11.2007 23:41, Andrew Morton a écrit : >> On Wed, 21 Nov 2007 22:45:22 +0100 >> Laurent Riffard <[EMAIL PROTECTED]> wrote: >> >>> Le 21.11.2007 05:45, Andrew Morton a écrit : ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ >>> Hello, >>> >>> My system hangs shortly after I logged in Gnome desktop. SysRq-W shows >>> that a bunch of task are blocked in "D" state, they seem to wait for >>> some I/O completion. I can try to hand-copy some data if requested. >>> >>> I found these messages in dmesg: >>> >>> ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 >>> EXT3-fs: mounted filesystem with ordered data mode. >>> sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT >>> driverbyte=DRIVER_OK,SUGGEST_OK >>> end_request: I/O error, dev sda, sector 16460 >>> ReiserFS: sda7: found reiserfs format "3.6" with standard journal >>> ReiserFS: sda7: using ordered data mode >>> -- >>> ReiserFS: sda7: Using r5 hash to sort names >>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT >>> driverbyte=DRIVER_OK,SUGGEST_OK >>> end_request: I/O error, dev sdb, sector 19632 >>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT >>> driverbyte=DRIVER_OK,SUGGEST_OK >>> end_request: I/O error, dev sdb, sector 40037363 >>> Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 extents:1 >>> across:1048568k >>> lp0: using parport0 (interrupt-driven). >>> >>> These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible. >>> 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. >>> >>> Maybe something is broken in pata_via driver ? >>> >> Could be - >> libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch >> and >> pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch >> touch pata_via.c. > > None of the above... > > I did a bisection, it spotted git-scsi-misc.patch. > I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine. > > I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do not > requeue requests if REQ_FAILFAST is set" is the real culprit. The other > commits are touching documentation or drivers I don't use. I'll try > to revert only this one this evening. > Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error where I shouldn't. Checking ... Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage [EMAIL PROTECTED] +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
Le 21.11.2007 23:41, Andrew Morton a écrit : > On Wed, 21 Nov 2007 22:45:22 +0100 > Laurent Riffard <[EMAIL PROTECTED]> wrote: > >> Le 21.11.2007 05:45, Andrew Morton a écrit : >>> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ >> Hello, >> >> My system hangs shortly after I logged in Gnome desktop. SysRq-W shows >> that a bunch of task are blocked in "D" state, they seem to wait for >> some I/O completion. I can try to hand-copy some data if requested. >> >> I found these messages in dmesg: >> >> ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 >> EXT3-fs: mounted filesystem with ordered data mode. >> sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT >> driverbyte=DRIVER_OK,SUGGEST_OK >> end_request: I/O error, dev sda, sector 16460 >> ReiserFS: sda7: found reiserfs format "3.6" with standard journal >> ReiserFS: sda7: using ordered data mode >> -- >> ReiserFS: sda7: Using r5 hash to sort names >> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT >> driverbyte=DRIVER_OK,SUGGEST_OK >> end_request: I/O error, dev sdb, sector 19632 >> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT >> driverbyte=DRIVER_OK,SUGGEST_OK >> end_request: I/O error, dev sdb, sector 40037363 >> Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 extents:1 >> across:1048568k >> lp0: using parport0 (interrupt-driven). >> >> These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible. >> 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. >> >> Maybe something is broken in pata_via driver ? >> > > Could be - > libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch > and > pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch > touch pata_via.c. None of the above... I did a bisection, it spotted git-scsi-misc.patch. I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine. I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do not requeue requests if REQ_FAILFAST is set" is the real culprit. The other commits are touching documentation or drivers I don't use. I'll try to revert only this one this evening. -- laurent - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
Laurent Riffard wrote: Le 21.11.2007 23:41, Andrew Morton a écrit : On Wed, 21 Nov 2007 22:45:22 +0100 Laurent Riffard [EMAIL PROTECTED] wrote: Le 21.11.2007 05:45, Andrew Morton a écrit : ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ Hello, My system hangs shortly after I logged in Gnome desktop. SysRq-W shows that a bunch of task are blocked in D state, they seem to wait for some I/O completion. I can try to hand-copy some data if requested. I found these messages in dmesg: ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 EXT3-fs: mounted filesystem with ordered data mode. sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sda, sector 16460 ReiserFS: sda7: found reiserfs format 3.6 with standard journal ReiserFS: sda7: using ordered data mode -- ReiserFS: sda7: Using r5 hash to sort names sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 19632 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 40037363 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 extents:1 across:1048568k lp0: using parport0 (interrupt-driven). These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible. 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. Maybe something is broken in pata_via driver ? Could be - libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch and pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch touch pata_via.c. None of the above... I did a bisection, it spotted git-scsi-misc.patch. I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine. I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not requeue requests if REQ_FAILFAST is set is the real culprit. The other commits are touching documentation or drivers I don't use. I'll try to revert only this one this evening. Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error where I shouldn't. Checking ... Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage [EMAIL PROTECTED] +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
Le 21.11.2007 23:41, Andrew Morton a écrit : On Wed, 21 Nov 2007 22:45:22 +0100 Laurent Riffard [EMAIL PROTECTED] wrote: Le 21.11.2007 05:45, Andrew Morton a écrit : ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ Hello, My system hangs shortly after I logged in Gnome desktop. SysRq-W shows that a bunch of task are blocked in D state, they seem to wait for some I/O completion. I can try to hand-copy some data if requested. I found these messages in dmesg: ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 EXT3-fs: mounted filesystem with ordered data mode. sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sda, sector 16460 ReiserFS: sda7: found reiserfs format 3.6 with standard journal ReiserFS: sda7: using ordered data mode -- ReiserFS: sda7: Using r5 hash to sort names sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 19632 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 40037363 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 extents:1 across:1048568k lp0: using parport0 (interrupt-driven). These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible. 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. Maybe something is broken in pata_via driver ? Could be - libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch and pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch touch pata_via.c. None of the above... I did a bisection, it spotted git-scsi-misc.patch. I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine. I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not requeue requests if REQ_FAILFAST is set is the real culprit. The other commits are touching documentation or drivers I don't use. I'll try to revert only this one this evening. -- laurent - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
On Wed, 21 Nov 2007 22:45:22 +0100 Laurent Riffard <[EMAIL PROTECTED]> wrote: > Le 21.11.2007 05:45, Andrew Morton a écrit : > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ > > Hello, > > My system hangs shortly after I logged in Gnome desktop. SysRq-W shows > that a bunch of task are blocked in "D" state, they seem to wait for > some I/O completion. I can try to hand-copy some data if requested. > > I found these messages in dmesg: > > ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 > EXT3-fs: mounted filesystem with ordered data mode. > sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT > driverbyte=DRIVER_OK,SUGGEST_OK > end_request: I/O error, dev sda, sector 16460 > ReiserFS: sda7: found reiserfs format "3.6" with standard journal > ReiserFS: sda7: using ordered data mode > -- > ReiserFS: sda7: Using r5 hash to sort names > sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT > driverbyte=DRIVER_OK,SUGGEST_OK > end_request: I/O error, dev sdb, sector 19632 > sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT > driverbyte=DRIVER_OK,SUGGEST_OK > end_request: I/O error, dev sdb, sector 40037363 > Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 extents:1 > across:1048568k > lp0: using parport0 (interrupt-driven). > > These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible. > 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. > > Maybe something is broken in pata_via driver ? > Could be - libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch and pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch touch pata_via.c. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm1: I/O error, system hangs
On Wed, 21 Nov 2007 22:45:22 +0100 Laurent Riffard [EMAIL PROTECTED] wrote: Le 21.11.2007 05:45, Andrew Morton a écrit : ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/ Hello, My system hangs shortly after I logged in Gnome desktop. SysRq-W shows that a bunch of task are blocked in D state, they seem to wait for some I/O completion. I can try to hand-copy some data if requested. I found these messages in dmesg: ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 EXT3-fs: mounted filesystem with ordered data mode. sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sda, sector 16460 ReiserFS: sda7: found reiserfs format 3.6 with standard journal ReiserFS: sda7: using ordered data mode -- ReiserFS: sda7: Using r5 hash to sort names sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 19632 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 40037363 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap. Priority:-1 extents:1 across:1048568k lp0: using parport0 (interrupt-driven). These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible. 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine. Maybe something is broken in pata_via driver ? Could be - libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch and pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch touch pata_via.c. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/