Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-28 Thread Laurent Riffard
Le 25.11.2007 21:39, Laurent Riffard a écrit :
> Le 25.11.2007 08:37, James Bottomley a écrit :
>> On Sat, 2007-11-24 at 23:59 +0100, Laurent Riffard wrote:
>>> Le 24.11.2007 14:26, James Bottomley a écrit :
 OK, could you post dmesgs again, please.  I actually tested this
>>> with an
 aic79xx card, and for me it does cause Domain Validation to succeed
 again.
>>> James, 
>>>
>>> Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch "separates
>>> the 
>>> BLOCK and QUIESCE states
>>> correctly" (http://lkml.org/lkml/2007/11/24/8).
>>>
[...]
>>> [   25.521256] scsi0 : pata_via
>>> [   25.521711] scsi1 : pata_via
>>> [   25.524089] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xb800 irq 
>>> 14
>>> [   25.524176] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xb808 irq 
>>> 15
>>> [   25.683141] ata1.00: ATA-5: ST340016A, 3.75, max UDMA/100
>>> [   25.683208] ata1.00: 78165360 sectors, multi 16: LBA 
>>> [   25.683475] ata1.01: ATA-7: Maxtor 6Y080L0, YAR41BW0, max UDMA/133
>>> [   25.684116] ata1.01: 160086528 sectors, multi 16: LBA 
>>> [   25.691127] ata1.00: configured for UDMA/100
>>> [   25.699142] ata1.01: configured for UDMA/100
>>> [   26.170860] ata2.00: ATAPI: HL-DT-ST DVDRAM GSA-4165B, DL05, max UDMA/33
>>> [   26.171562] ata2.01: ATAPI: CD-950E/AKU, A4Q, max MWDMA2, CDB intr
>>> [   26.330839] ata2.00: configured for UDMA/33
>>> [   26.490828] ata2.01: configured for MWDMA2
>>> [   26.503014] scsi 0:0:0:0: Direct-Access ATA  ST340016A 3.75 PQ: 
>>> 0 ANSI: 5
>>> [   26.504670] scsi 0:0:1:0: Direct-Access ATA  Maxtor 6Y080L0 YAR4 
>>> PQ: 0 ANSI: 5
>>> [   26.509842] scsi 1:0:0:0: CD-ROMHL-DT-ST DVDRAM GSA-4165B 
>>> DL05 PQ: 0 ANSI: 5
>>> [   26.511673] scsi 1:0:1:0: CD-ROME-IDECD-950E/AKU A4Q  
>>> PQ: 0 ANSI: 5
>> [...]
>>> [   60.216113] sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
>>> driverbyte=DRIVER_OK,SUGGEST_OK
>>> [   60.216124] end_request: I/O error, dev sda, sector 16460
>> I think this one's quite easy:  PATA devices in libata are queue depth 1
>> (since they don't do NCQ).  Thus, they're peculiarly sensitive to the
>> bug where we fail over queue depth requests.
>>
>> On the other hand, I don't see how a filesystem request is getting
>> REQ_FAILFAST ... unless there's a bio or readahead issue involved.
>> Anyway, could you try this patch:
>>
>> http://marc.info/?l=linux-scsi=119592627425498
>>
>> Which should fix the queue depth issue, and see if the errors go away?
> 
> No, this one doesn't help...
 
still happens with 2.6.24-rc3-mm2...
-- 
laurent
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-28 Thread Laurent Riffard
Le 25.11.2007 21:39, Laurent Riffard a écrit :
 Le 25.11.2007 08:37, James Bottomley a écrit :
 On Sat, 2007-11-24 at 23:59 +0100, Laurent Riffard wrote:
 Le 24.11.2007 14:26, James Bottomley a écrit :
 OK, could you post dmesgs again, please.  I actually tested this
 with an
 aic79xx card, and for me it does cause Domain Validation to succeed
 again.
 James, 

 Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch separates
 the 
 BLOCK and QUIESCE states
 correctly (http://lkml.org/lkml/2007/11/24/8).

[...]
 [   25.521256] scsi0 : pata_via
 [   25.521711] scsi1 : pata_via
 [   25.524089] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xb800 irq 
 14
 [   25.524176] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xb808 irq 
 15
 [   25.683141] ata1.00: ATA-5: ST340016A, 3.75, max UDMA/100
 [   25.683208] ata1.00: 78165360 sectors, multi 16: LBA 
 [   25.683475] ata1.01: ATA-7: Maxtor 6Y080L0, YAR41BW0, max UDMA/133
 [   25.684116] ata1.01: 160086528 sectors, multi 16: LBA 
 [   25.691127] ata1.00: configured for UDMA/100
 [   25.699142] ata1.01: configured for UDMA/100
 [   26.170860] ata2.00: ATAPI: HL-DT-ST DVDRAM GSA-4165B, DL05, max UDMA/33
 [   26.171562] ata2.01: ATAPI: CD-950E/AKU, A4Q, max MWDMA2, CDB intr
 [   26.330839] ata2.00: configured for UDMA/33
 [   26.490828] ata2.01: configured for MWDMA2
 [   26.503014] scsi 0:0:0:0: Direct-Access ATA  ST340016A 3.75 PQ: 
 0 ANSI: 5
 [   26.504670] scsi 0:0:1:0: Direct-Access ATA  Maxtor 6Y080L0 YAR4 
 PQ: 0 ANSI: 5
 [   26.509842] scsi 1:0:0:0: CD-ROMHL-DT-ST DVDRAM GSA-4165B 
 DL05 PQ: 0 ANSI: 5
 [   26.511673] scsi 1:0:1:0: CD-ROME-IDECD-950E/AKU A4Q  
 PQ: 0 ANSI: 5
 [...]
 [   60.216113] sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 [   60.216124] end_request: I/O error, dev sda, sector 16460
 I think this one's quite easy:  PATA devices in libata are queue depth 1
 (since they don't do NCQ).  Thus, they're peculiarly sensitive to the
 bug where we fail over queue depth requests.

 On the other hand, I don't see how a filesystem request is getting
 REQ_FAILFAST ... unless there's a bio or readahead issue involved.
 Anyway, could you try this patch:

 http://marc.info/?l=linux-scsim=119592627425498

 Which should fix the queue depth issue, and see if the errors go away?
 
 No, this one doesn't help...
 
still happens with 2.6.24-rc3-mm2...
-- 
laurent
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-25 Thread Hannes Reinecke
On Sat, Nov 24, 2007 at 07:44:13PM +0200, James Bottomley wrote:
> Probing intermittent failures in Domain Validation, even with the fixes
> applied leads me to the conclusion that there are further problems with
> this commit:
> 
> commit fc5eb4facedbd6d7117905e775cee1975f894e79
> Author: Hannes Reinecke <[EMAIL PROTECTED]>
> Date:   Tue Nov 6 09:23:40 2007 +0100
> 
> [SCSI] Do not requeue requests if REQ_FAILFAST is set
>  
> The essence of the problems is that you're causing REQ_FAILFAST to
> terminate commands with error on requeuing conditions, some of which are
> relatively common on most SCSI devices.  While this may be the correct
> behaviour for multi-path, it's certainly wrong for the previously
> understood meaning of REQ_FAILFAST, which was don't retry on error,
> which is why domain validation and other applications use it to control
> error handling, but don't expect to get failures for a simple requeue
> are now spitting errors.
> 
> I honestly can't see that, even for the multi-path case, returning an
> error when we're over queue depth is the correct thing to do (it may not
> matter to something like a symmetrix, but an array that has a non-zero
> cost associated with a path change, like a CPQ HSV or the AVT
> controllers, will show fairly large slow downs if you do this).  Even if
> this is the desired behaviour (and I think that's a policy issue),
> DID_NO_CONNECT is almost certainly the wrong error to be sending back.
> 
> This patch fixes up domain validation to work again correctly, however,
> I really think it's just a bandaid.  Do you want to rethink the above
> commit?
> 
Given the amounted error, yes, I'll have to.
But we still face the initial problem that requeued requests will be
stuck in the queue forever (ie until the timeout catches it), causing
failover to be painfully slow.

Anyway, I'll think it over.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
[EMAIL PROTECTED] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Markus Rex, HRB 16746 (AG N�rnberg)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-25 Thread Laurent Riffard
Le 25.11.2007 08:37, James Bottomley a écrit :
> On Sat, 2007-11-24 at 23:59 +0100, Laurent Riffard wrote:
>> Le 24.11.2007 14:26, James Bottomley a écrit :
>>> OK, could you post dmesgs again, please.  I actually tested this
>> with an
>>> aic79xx card, and for me it does cause Domain Validation to succeed
>>> again.
>> James, 
>>
>> Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch "separates
>> the 
>> BLOCK and QUIESCE states
>> correctly" (http://lkml.org/lkml/2007/11/24/8).
>>
>> How to reproduce :
>> - boot
>> - switch to a text console
>> - capture dmesg in a file, sync, etc. There are 3 I/O errors, but the 
>>   system does work.
>> - switch to X console, log in the Gnome Desktop, the system partially 
>>   hangs.
>> - switch back to a text console: dmesg(1) still works, it shows some 
>>   additonal I/O errors. At this point, any disk access makes the system 
>>   completely hung.
>>
>> Additionnal data:
>> - the I/O errors always happen on the same blocks.
>>
>> plain text document attachment (dmesg-2.6.24-rc3-mm1-patched)
> [...]
>> [   25.521256] scsi0 : pata_via
>> [   25.521711] scsi1 : pata_via
>> [   25.524089] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xb800 irq 
>> 14
>> [   25.524176] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xb808 irq 
>> 15
>> [   25.683141] ata1.00: ATA-5: ST340016A, 3.75, max UDMA/100
>> [   25.683208] ata1.00: 78165360 sectors, multi 16: LBA 
>> [   25.683475] ata1.01: ATA-7: Maxtor 6Y080L0, YAR41BW0, max UDMA/133
>> [   25.684116] ata1.01: 160086528 sectors, multi 16: LBA 
>> [   25.691127] ata1.00: configured for UDMA/100
>> [   25.699142] ata1.01: configured for UDMA/100
>> [   26.170860] ata2.00: ATAPI: HL-DT-ST DVDRAM GSA-4165B, DL05, max UDMA/33
>> [   26.171562] ata2.01: ATAPI: CD-950E/AKU, A4Q, max MWDMA2, CDB intr
>> [   26.330839] ata2.00: configured for UDMA/33
>> [   26.490828] ata2.01: configured for MWDMA2
>> [   26.503014] scsi 0:0:0:0: Direct-Access ATA  ST340016A 3.75 PQ: 0 
>> ANSI: 5
>> [   26.504670] scsi 0:0:1:0: Direct-Access ATA  Maxtor 6Y080L0 YAR4 
>> PQ: 0 ANSI: 5
>> [   26.509842] scsi 1:0:0:0: CD-ROMHL-DT-ST DVDRAM GSA-4165B 
>> DL05 PQ: 0 ANSI: 5
>> [   26.511673] scsi 1:0:1:0: CD-ROME-IDECD-950E/AKU A4Q  PQ: 
>> 0 ANSI: 5
> [...]
>> [   60.216113] sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
>> driverbyte=DRIVER_OK,SUGGEST_OK
>> [   60.216124] end_request: I/O error, dev sda, sector 16460
> 
> I think this one's quite easy:  PATA devices in libata are queue depth 1
> (since they don't do NCQ).  Thus, they're peculiarly sensitive to the
> bug where we fail over queue depth requests.
> 
> On the other hand, I don't see how a filesystem request is getting
> REQ_FAILFAST ... unless there's a bio or readahead issue involved.
> Anyway, could you try this patch:
> 
> http://marc.info/?l=linux-scsi=119592627425498
> 
> Which should fix the queue depth issue, and see if the errors go away?

No, this one doesn't help...

-- 
laurent
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-25 Thread Laurent Riffard
Le 25.11.2007 08:37, James Bottomley a écrit :
 On Sat, 2007-11-24 at 23:59 +0100, Laurent Riffard wrote:
 Le 24.11.2007 14:26, James Bottomley a écrit :
 OK, could you post dmesgs again, please.  I actually tested this
 with an
 aic79xx card, and for me it does cause Domain Validation to succeed
 again.
 James, 

 Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch separates
 the 
 BLOCK and QUIESCE states
 correctly (http://lkml.org/lkml/2007/11/24/8).

 How to reproduce :
 - boot
 - switch to a text console
 - capture dmesg in a file, sync, etc. There are 3 I/O errors, but the 
   system does work.
 - switch to X console, log in the Gnome Desktop, the system partially 
   hangs.
 - switch back to a text console: dmesg(1) still works, it shows some 
   additonal I/O errors. At this point, any disk access makes the system 
   completely hung.

 Additionnal data:
 - the I/O errors always happen on the same blocks.

 plain text document attachment (dmesg-2.6.24-rc3-mm1-patched)
 [...]
 [   25.521256] scsi0 : pata_via
 [   25.521711] scsi1 : pata_via
 [   25.524089] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xb800 irq 
 14
 [   25.524176] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xb808 irq 
 15
 [   25.683141] ata1.00: ATA-5: ST340016A, 3.75, max UDMA/100
 [   25.683208] ata1.00: 78165360 sectors, multi 16: LBA 
 [   25.683475] ata1.01: ATA-7: Maxtor 6Y080L0, YAR41BW0, max UDMA/133
 [   25.684116] ata1.01: 160086528 sectors, multi 16: LBA 
 [   25.691127] ata1.00: configured for UDMA/100
 [   25.699142] ata1.01: configured for UDMA/100
 [   26.170860] ata2.00: ATAPI: HL-DT-ST DVDRAM GSA-4165B, DL05, max UDMA/33
 [   26.171562] ata2.01: ATAPI: CD-950E/AKU, A4Q, max MWDMA2, CDB intr
 [   26.330839] ata2.00: configured for UDMA/33
 [   26.490828] ata2.01: configured for MWDMA2
 [   26.503014] scsi 0:0:0:0: Direct-Access ATA  ST340016A 3.75 PQ: 0 
 ANSI: 5
 [   26.504670] scsi 0:0:1:0: Direct-Access ATA  Maxtor 6Y080L0 YAR4 
 PQ: 0 ANSI: 5
 [   26.509842] scsi 1:0:0:0: CD-ROMHL-DT-ST DVDRAM GSA-4165B 
 DL05 PQ: 0 ANSI: 5
 [   26.511673] scsi 1:0:1:0: CD-ROME-IDECD-950E/AKU A4Q  PQ: 
 0 ANSI: 5
 [...]
 [   60.216113] sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 [   60.216124] end_request: I/O error, dev sda, sector 16460
 
 I think this one's quite easy:  PATA devices in libata are queue depth 1
 (since they don't do NCQ).  Thus, they're peculiarly sensitive to the
 bug where we fail over queue depth requests.
 
 On the other hand, I don't see how a filesystem request is getting
 REQ_FAILFAST ... unless there's a bio or readahead issue involved.
 Anyway, could you try this patch:
 
 http://marc.info/?l=linux-scsim=119592627425498
 
 Which should fix the queue depth issue, and see if the errors go away?

No, this one doesn't help...

-- 
laurent
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-25 Thread Hannes Reinecke
On Sat, Nov 24, 2007 at 07:44:13PM +0200, James Bottomley wrote:
 Probing intermittent failures in Domain Validation, even with the fixes
 applied leads me to the conclusion that there are further problems with
 this commit:
 
 commit fc5eb4facedbd6d7117905e775cee1975f894e79
 Author: Hannes Reinecke [EMAIL PROTECTED]
 Date:   Tue Nov 6 09:23:40 2007 +0100
 
 [SCSI] Do not requeue requests if REQ_FAILFAST is set
  
 The essence of the problems is that you're causing REQ_FAILFAST to
 terminate commands with error on requeuing conditions, some of which are
 relatively common on most SCSI devices.  While this may be the correct
 behaviour for multi-path, it's certainly wrong for the previously
 understood meaning of REQ_FAILFAST, which was don't retry on error,
 which is why domain validation and other applications use it to control
 error handling, but don't expect to get failures for a simple requeue
 are now spitting errors.
 
 I honestly can't see that, even for the multi-path case, returning an
 error when we're over queue depth is the correct thing to do (it may not
 matter to something like a symmetrix, but an array that has a non-zero
 cost associated with a path change, like a CPQ HSV or the AVT
 controllers, will show fairly large slow downs if you do this).  Even if
 this is the desired behaviour (and I think that's a policy issue),
 DID_NO_CONNECT is almost certainly the wrong error to be sending back.
 
 This patch fixes up domain validation to work again correctly, however,
 I really think it's just a bandaid.  Do you want to rethink the above
 commit?
 
Given the amounted error, yes, I'll have to.
But we still face the initial problem that requeued requests will be
stuck in the queue forever (ie until the timeout catches it), causing
failover to be painfully slow.

Anyway, I'll think it over.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
[EMAIL PROTECTED] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Markus Rex, HRB 16746 (AG N�rnberg)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread James Bottomley
On Sat, 2007-11-24 at 23:59 +0100, Laurent Riffard wrote:
> Le 24.11.2007 14:26, James Bottomley a écrit :
> > OK, could you post dmesgs again, please.  I actually tested this
> with an
> > aic79xx card, and for me it does cause Domain Validation to succeed
> > again.
> 
> James, 
> 
> Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch "separates
> the 
> BLOCK and QUIESCE states
> correctly" (http://lkml.org/lkml/2007/11/24/8).
> 
> How to reproduce :
> - boot
> - switch to a text console
> - capture dmesg in a file, sync, etc. There are 3 I/O errors, but the 
>   system does work.
> - switch to X console, log in the Gnome Desktop, the system partially 
>   hangs.
> - switch back to a text console: dmesg(1) still works, it shows some 
>   additonal I/O errors. At this point, any disk access makes the
> system 
>   completely hung.
> 
> Additionnal data:
> - the I/O errors always happen on the same blocks.
> 
> plain text document attachment (dmesg-2.6.24-rc3-mm1-patched)
[...]
> [   25.521256] scsi0 : pata_via
> [   25.521711] scsi1 : pata_via
> [   25.524089] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma
> 0xb800 irq 14
> [   25.524176] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma
> 0xb808 irq 15
> [   25.683141] ata1.00: ATA-5: ST340016A, 3.75, max UDMA/100
> [   25.683208] ata1.00: 78165360 sectors, multi 16: LBA 
> [   25.683475] ata1.01: ATA-7: Maxtor 6Y080L0, YAR41BW0, max UDMA/133
> [   25.684116] ata1.01: 160086528 sectors, multi 16: LBA 
> [   25.691127] ata1.00: configured for UDMA/100
> [   25.699142] ata1.01: configured for UDMA/100
> [   26.170860] ata2.00: ATAPI: HL-DT-ST DVDRAM GSA-4165B, DL05, max
> UDMA/33
> [   26.171562] ata2.01: ATAPI: CD-950E/AKU, A4Q, max MWDMA2, CDB intr
> [   26.330839] ata2.00: configured for UDMA/33
> [   26.490828] ata2.01: configured for MWDMA2
> [   26.503014] scsi 0:0:0:0: Direct-Access ATA  ST340016A
> 3.75 PQ: 0 ANSI: 5
> [   26.504670] scsi 0:0:1:0: Direct-Access ATA  Maxtor 6Y080L0
> YAR4 PQ: 0 ANSI: 5
> [   26.509842] scsi 1:0:0:0: CD-ROMHL-DT-ST DVDRAM
> GSA-4165B DL05 PQ: 0 ANSI: 5
> [   26.511673] scsi 1:0:1:0: CD-ROME-IDECD-950E/AKU
> A4Q  PQ: 0 ANSI: 5
[...]
> [   60.216113] sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT
> driverbyte=DRIVER_OK,SUGGEST_OK
> [   60.216124] end_request: I/O error, dev sda, sector 16460

I think this one's quite easy:  PATA devices in libata are queue depth 1
(since they don't do NCQ).  Thus, they're peculiarly sensitive to the
bug where we fail over queue depth requests.

On the other hand, I don't see how a filesystem request is getting
REQ_FAILFAST ... unless there's a bio or readahead issue involved.
Anyway, could you try this patch:

http://marc.info/?l=linux-scsi=119592627425498

Which should fix the queue depth issue, and see if the errors go away?

Thanks,

James


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread Laurent Riffard
Le 24.11.2007 14:26, James Bottomley a écrit :
> On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
>> Le 24.11.2007 07:42, James Bottomley a écrit :
>>> On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
 Le 23.11.2007 12:38, Hannes Reinecke a écrit :
[snip]
 I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
 does fix the problem.

>> Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
>> error where
>> I shouldn't. Checking ...
>>
> Ok, found it. We are blocking even special commands (ie requests with 
> PREEMPT not set)
> when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
> this.
 Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
 errors.
>>> I think the problem is the way we treat BLOCKED and QUIESCED (the latter
>>> is the state that the domain validation uses and which we cannot kill
>>> fastfail on).  It's definitely wrong to kill fastfail requests when the
>>> state is QUIESCE.
>>>
>>> This patch (which is applied on top of Hannes original) separates the
>>> BLOCK and QUIESCE states correctly ... does this fix the problem?
>>
>> No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)
> 
> OK, could you post dmesgs again, please.  I actually tested this with an
> aic79xx card, and for me it does cause Domain Validation to succeed
> again.

James, 

Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch "separates the 
BLOCK and QUIESCE states correctly" (http://lkml.org/lkml/2007/11/24/8).

How to reproduce :
- boot
- switch to a text console
- capture dmesg in a file, sync, etc. There are 3 I/O errors, but the 
  system does work.
- switch to X console, log in the Gnome Desktop, the system partially 
  hangs.
- switch back to a text console: dmesg(1) still works, it shows some 
  additonal I/O errors. At this point, any disk access makes the system 
  completely hung.

Additionnal data:
- the I/O errors always happen on the same blocks.

-- 
laurent
[0.00] Linux version 2.6.24-rc3-mm1 ([EMAIL PROTECTED]) (gcc version 
4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)) #122 PREEMPT Fri Nov 23 
18:47:58 CET 2007
[0.00] BIOS-provided physical RAM map:
[0.00]  BIOS-e820:  - 0009fc00 (usable)
[0.00]  BIOS-e820: 0009fc00 - 000a (reserved)
[0.00]  BIOS-e820: 000f - 0010 (reserved)
[0.00]  BIOS-e820: 0010 - 1ffec000 (usable)
[0.00]  BIOS-e820: 1ffec000 - 1ffef000 (ACPI data)
[0.00]  BIOS-e820: 1ffef000 - 1000 (reserved)
[0.00]  BIOS-e820: 1000 - 2000 (ACPI NVS)
[0.00]  BIOS-e820:  - 0001 (reserved)
[0.00] 511MB LOWMEM available.
[0.00] Entering add_active_range(0, 0, 131052) 0 entries of 256 used
[0.00] sizeof(struct page) = 32
[0.00] Zone PFN ranges:
[0.00]   DMA 0 -> 4096
[0.00]   Normal   4096 ->   131052
[0.00] Movable zone start PFN for each node
[0.00] early_node_map[1] active PFN ranges
[0.00] 0:0 ->   131052
[0.00] On node 0 totalpages: 131052
[0.00] Node 0 memmap at 0xC100 size 4194304 first pfn 0xC100
[0.00]   DMA zone: 32 pages used for memmap
[0.00]   DMA zone: 0 pages reserved
[0.00]   DMA zone: 4064 pages, LIFO batch:0
[0.00]   Normal zone: 991 pages used for memmap
[0.00]   Normal zone: 125965 pages, LIFO batch:31
[0.00]   Movable zone: 0 pages used for memmap
[0.00] DMI 2.3 present.
[0.00] ACPI: RSDP 000F6A80, 0014 (r0 ASUS  )
[0.00] ACPI: RSDT 1FFEC000, 002C (r1 ASUS   A7V133-C 30303031 MSFT 
31313031)
[0.00] ACPI: FACP 1FFEC080, 0074 (r1 ASUS   A7V133-C 30303031 MSFT 
31313031)
[0.00] ACPI: DSDT 1FFEC100, 2CE1 (r1   ASUS A7V133-C 1000 MSFT  
10B)
[0.00] ACPI: FACS 1000, 0040
[0.00] ACPI: BOOT 1FFEC040, 0028 (r1 ASUS   A7V133-C 30303031 MSFT 
31313031)
[0.00] ACPI: PM-Timer IO Port: 0xe408
[0.00] Allocating PCI resources starting at 3000 (gap: 
2000:dfff)
[0.00] swsusp: Registered nosave memory region: 0009f000 - 
000a
[0.00] swsusp: Registered nosave memory region: 000a - 
000f
[0.00] swsusp: Registered nosave memory region: 000f - 
0010
[0.00] Built 1 zonelists in Zone order, mobility grouping on.  Total 
pages: 130029
[0.00] Kernel command line: root=/dev/mapper/vglinux1-lv_ubuntu2 ro 
locale=fr_FR video=radeonfb:[EMAIL PROTECTED] resume=/dev/mapper/vglinux1-lvswap
[0.00] Local APIC disabled by BIOS -- you can enable it with "lapic"
[0.00] mapped APIC to b000 

Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread Gabriel C
Gabriel C wrote:
> James Bottomley wrote:
>> On Sat, 2007-11-24 at 18:54 +0100, Gabriel C wrote:
>>> James Bottomley wrote:
 On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
> Le 24.11.2007 07:42, James Bottomley a écrit :
>> On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
>>> Le 23.11.2007 12:38, Hannes Reinecke a écrit :
 Hannes Reinecke wrote:
> Laurent Riffard wrote:
>> Le 21.11.2007 23:41, Andrew Morton a écrit :
>>> On Wed, 21 Nov 2007 22:45:22 +0100
>>> Laurent Riffard <[EMAIL PROTECTED]> wrote:
>>>
 Le 21.11.2007 05:45, Andrew Morton a écrit :
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
 Hello, 

 My system hangs shortly after I logged in Gnome desktop. SysRq-W 
 shows
 that a bunch of task are blocked in "D" state, they seem to wait 
 for
 some I/O completion. I can try to hand-copy some data if requested.

 I found these messages in dmesg:

 ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
 EXT3-fs: mounted filesystem with ordered data mode.
 sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sda, sector 16460
 ReiserFS: sda7: found reiserfs format "3.6" with standard journal
 ReiserFS: sda7: using ordered data mode
 --
 ReiserFS: sda7: Using r5 hash to sort names
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 19632
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 40037363
 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
 extents:1 across:1048568k
 lp0: using parport0 (interrupt-driven).

 These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
 reproducible.
 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.

 Maybe something is broken in pata_via driver ?

>>> Could be - 
>>> libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
>>> and 
>>> pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
>>> touch pata_via.c.
>> None of the above...
>>
>> I did a bisection, it spotted git-scsi-misc.patch. 
>> I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works 
>> fine.
>>
>> I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do 
>> not 
>> requeue requests if REQ_FAILFAST is set" is the real culprit. The 
>> other 
>> commits are touching documentation or drivers I don't use. I'll try 
>> to revert only this one this evening.
>>> I can confirm : reverting commit 
>>> 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
>>> does fix the problem.
>>>
> Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
> error where
> I shouldn't. Checking ...
>
 Ok, found it. We are blocking even special commands (ie requests with 
 PREEMPT not set)
 when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
 this.
>>> Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with 
>>> I/O errors.
>> I think the problem is the way we treat BLOCKED and QUIESCED (the latter
>> is the state that the domain validation uses and which we cannot kill
>> fastfail on).  It's definitely wrong to kill fastfail requests when the
>> state is QUIESCE.
>>
>> This patch (which is applied on top of Hannes original) separates the
>> BLOCK and QUIESCE states correctly ... does this fix the problem?
> No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)
 OK, could you post dmesgs again, please.  I actually tested this with an
 aic79xx card, and for me it does cause Domain Validation to succeed
 again.

>>> Are the patches indeed to fix that problem as well ? 
>>>
>>> http://lkml.org/lkml/2007/11/23/5
>> That dmesg is from an unknown SCSI card exhibiting Domain Validation
>> problems, so it's a reasonable probability, yes ... but you'll need the
>> additional hack I just did to prevent further intermittent failures.
> 
> My controller is:
> 
> 03:0e.0 SCSI storage controller [0100]: Adaptec AIC-7892P U160/m [9005:008f] 
> (rev 02)
> 
> I'll try the patches in a bit.

With your patches my problem(s) are solved. Domain Validation works again.

...

[   32.179521] scsi 

Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread Gabriel C
James Bottomley wrote:
> On Sat, 2007-11-24 at 18:54 +0100, Gabriel C wrote:
>> James Bottomley wrote:
>>> On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
 Le 24.11.2007 07:42, James Bottomley a écrit :
> On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
>> Le 23.11.2007 12:38, Hannes Reinecke a écrit :
>>> Hannes Reinecke wrote:
 Laurent Riffard wrote:
> Le 21.11.2007 23:41, Andrew Morton a écrit :
>> On Wed, 21 Nov 2007 22:45:22 +0100
>> Laurent Riffard <[EMAIL PROTECTED]> wrote:
>>
>>> Le 21.11.2007 05:45, Andrew Morton a écrit :
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
>>> Hello, 
>>>
>>> My system hangs shortly after I logged in Gnome desktop. SysRq-W 
>>> shows
>>> that a bunch of task are blocked in "D" state, they seem to wait for
>>> some I/O completion. I can try to hand-copy some data if requested.
>>>
>>> I found these messages in dmesg:
>>>
>>> ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
>>> EXT3-fs: mounted filesystem with ordered data mode.
>>> sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
>>> driverbyte=DRIVER_OK,SUGGEST_OK
>>> end_request: I/O error, dev sda, sector 16460
>>> ReiserFS: sda7: found reiserfs format "3.6" with standard journal
>>> ReiserFS: sda7: using ordered data mode
>>> --
>>> ReiserFS: sda7: Using r5 hash to sort names
>>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
>>> driverbyte=DRIVER_OK,SUGGEST_OK
>>> end_request: I/O error, dev sdb, sector 19632
>>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
>>> driverbyte=DRIVER_OK,SUGGEST_OK
>>> end_request: I/O error, dev sdb, sector 40037363
>>> Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
>>> extents:1 across:1048568k
>>> lp0: using parport0 (interrupt-driven).
>>>
>>> These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
>>> reproducible.
>>> 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
>>>
>>> Maybe something is broken in pata_via driver ?
>>>
>> Could be - 
>> libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
>> and 
>> pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
>> touch pata_via.c.
> None of the above...
>
> I did a bisection, it spotted git-scsi-misc.patch. 
> I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works 
> fine.
>
> I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do 
> not 
> requeue requests if REQ_FAILFAST is set" is the real culprit. The 
> other 
> commits are touching documentation or drivers I don't use. I'll try 
> to revert only this one this evening.
>> I can confirm : reverting commit 
>> 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
>> does fix the problem.
>>
 Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
 error where
 I shouldn't. Checking ...

>>> Ok, found it. We are blocking even special commands (ie requests with 
>>> PREEMPT not set)
>>> when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
>>> this.
>> Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
>> errors.
> I think the problem is the way we treat BLOCKED and QUIESCED (the latter
> is the state that the domain validation uses and which we cannot kill
> fastfail on).  It's definitely wrong to kill fastfail requests when the
> state is QUIESCE.
>
> This patch (which is applied on top of Hannes original) separates the
> BLOCK and QUIESCE states correctly ... does this fix the problem?
 No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)
>>> OK, could you post dmesgs again, please.  I actually tested this with an
>>> aic79xx card, and for me it does cause Domain Validation to succeed
>>> again.
>>>
>> Are the patches indeed to fix that problem as well ? 
>>
>> http://lkml.org/lkml/2007/11/23/5
> 
> That dmesg is from an unknown SCSI card exhibiting Domain Validation
> problems, so it's a reasonable probability, yes ... but you'll need the
> additional hack I just did to prevent further intermittent failures.

My controller is:

03:0e.0 SCSI storage controller [0100]: Adaptec AIC-7892P U160/m [9005:008f] 
(rev 02)

I'll try the patches in a bit.

> 
> James
> 

Gabriel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread James Bottomley

On Sat, 2007-11-24 at 18:54 +0100, Gabriel C wrote:
> James Bottomley wrote:
> > On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
> >> Le 24.11.2007 07:42, James Bottomley a écrit :
> >>> On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
>  Le 23.11.2007 12:38, Hannes Reinecke a écrit :
> > Hannes Reinecke wrote:
> >> Laurent Riffard wrote:
> >>> Le 21.11.2007 23:41, Andrew Morton a écrit :
>  On Wed, 21 Nov 2007 22:45:22 +0100
>  Laurent Riffard <[EMAIL PROTECTED]> wrote:
> 
> > Le 21.11.2007 05:45, Andrew Morton a écrit :
> >> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
> > Hello, 
> >
> > My system hangs shortly after I logged in Gnome desktop. SysRq-W 
> > shows
> > that a bunch of task are blocked in "D" state, they seem to wait for
> > some I/O completion. I can try to hand-copy some data if requested.
> >
> > I found these messages in dmesg:
> >
> > ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
> > EXT3-fs: mounted filesystem with ordered data mode.
> > sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
> > driverbyte=DRIVER_OK,SUGGEST_OK
> > end_request: I/O error, dev sda, sector 16460
> > ReiserFS: sda7: found reiserfs format "3.6" with standard journal
> > ReiserFS: sda7: using ordered data mode
> > --
> > ReiserFS: sda7: Using r5 hash to sort names
> > sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
> > driverbyte=DRIVER_OK,SUGGEST_OK
> > end_request: I/O error, dev sdb, sector 19632
> > sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
> > driverbyte=DRIVER_OK,SUGGEST_OK
> > end_request: I/O error, dev sdb, sector 40037363
> > Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
> > extents:1 across:1048568k
> > lp0: using parport0 (interrupt-driven).
> >
> > These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
> > reproducible.
> > 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
> >
> > Maybe something is broken in pata_via driver ?
> >
>  Could be - 
>  libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
>  and 
>  pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
>  touch pata_via.c.
> >>> None of the above...
> >>>
> >>> I did a bisection, it spotted git-scsi-misc.patch. 
> >>> I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works 
> >>> fine.
> >>>
> >>> I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do 
> >>> not 
> >>> requeue requests if REQ_FAILFAST is set" is the real culprit. The 
> >>> other 
> >>> commits are touching documentation or drivers I don't use. I'll try 
> >>> to revert only this one this evening.
>  I can confirm : reverting commit 
>  8655a546c83fc43f0a73416bbd126d02de7ad6c0 
>  does fix the problem.
> 
> >> Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
> >> error where
> >> I shouldn't. Checking ...
> >>
> > Ok, found it. We are blocking even special commands (ie requests with 
> > PREEMPT not set)
> > when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
> > this.
>  Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
>  errors.
> >>> I think the problem is the way we treat BLOCKED and QUIESCED (the latter
> >>> is the state that the domain validation uses and which we cannot kill
> >>> fastfail on).  It's definitely wrong to kill fastfail requests when the
> >>> state is QUIESCE.
> >>>
> >>> This patch (which is applied on top of Hannes original) separates the
> >>> BLOCK and QUIESCE states correctly ... does this fix the problem?
> >>
> >> No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)
> > 
> > OK, could you post dmesgs again, please.  I actually tested this with an
> > aic79xx card, and for me it does cause Domain Validation to succeed
> > again.
> > 
> 
> Are the patches indeed to fix that problem as well ? 
> 
> http://lkml.org/lkml/2007/11/23/5

That dmesg is from an unknown SCSI card exhibiting Domain Validation
problems, so it's a reasonable probability, yes ... but you'll need the
additional hack I just did to prevent further intermittent failures.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread Gabriel C
James Bottomley wrote:
> On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
>> Le 24.11.2007 07:42, James Bottomley a écrit :
>>> On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
 Le 23.11.2007 12:38, Hannes Reinecke a écrit :
> Hannes Reinecke wrote:
>> Laurent Riffard wrote:
>>> Le 21.11.2007 23:41, Andrew Morton a écrit :
 On Wed, 21 Nov 2007 22:45:22 +0100
 Laurent Riffard <[EMAIL PROTECTED]> wrote:

> Le 21.11.2007 05:45, Andrew Morton a écrit :
>> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
> Hello, 
>
> My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
> that a bunch of task are blocked in "D" state, they seem to wait for
> some I/O completion. I can try to hand-copy some data if requested.
>
> I found these messages in dmesg:
>
> ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
> EXT3-fs: mounted filesystem with ordered data mode.
> sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
> driverbyte=DRIVER_OK,SUGGEST_OK
> end_request: I/O error, dev sda, sector 16460
> ReiserFS: sda7: found reiserfs format "3.6" with standard journal
> ReiserFS: sda7: using ordered data mode
> --
> ReiserFS: sda7: Using r5 hash to sort names
> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
> driverbyte=DRIVER_OK,SUGGEST_OK
> end_request: I/O error, dev sdb, sector 19632
> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
> driverbyte=DRIVER_OK,SUGGEST_OK
> end_request: I/O error, dev sdb, sector 40037363
> Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
> extents:1 across:1048568k
> lp0: using parport0 (interrupt-driven).
>
> These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
> reproducible.
> 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
>
> Maybe something is broken in pata_via driver ?
>
 Could be - 
 libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
 and 
 pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
 touch pata_via.c.
>>> None of the above...
>>>
>>> I did a bisection, it spotted git-scsi-misc.patch. 
>>> I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works 
>>> fine.
>>>
>>> I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do not 
>>> requeue requests if REQ_FAILFAST is set" is the real culprit. The other 
>>> commits are touching documentation or drivers I don't use. I'll try 
>>> to revert only this one this evening.
 I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
 does fix the problem.

>> Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
>> error where
>> I shouldn't. Checking ...
>>
> Ok, found it. We are blocking even special commands (ie requests with 
> PREEMPT not set)
> when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
> this.
 Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
 errors.
>>> I think the problem is the way we treat BLOCKED and QUIESCED (the latter
>>> is the state that the domain validation uses and which we cannot kill
>>> fastfail on).  It's definitely wrong to kill fastfail requests when the
>>> state is QUIESCE.
>>>
>>> This patch (which is applied on top of Hannes original) separates the
>>> BLOCK and QUIESCE states correctly ... does this fix the problem?
>>
>> No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)
> 
> OK, could you post dmesgs again, please.  I actually tested this with an
> aic79xx card, and for me it does cause Domain Validation to succeed
> again.
> 

Are the patches indeed to fix that problem as well ? 

http://lkml.org/lkml/2007/11/23/5

> James

Gabriel 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread James Bottomley
Probing intermittent failures in Domain Validation, even with the fixes
applied leads me to the conclusion that there are further problems with
this commit:

commit fc5eb4facedbd6d7117905e775cee1975f894e79
Author: Hannes Reinecke <[EMAIL PROTECTED]>
Date:   Tue Nov 6 09:23:40 2007 +0100

[SCSI] Do not requeue requests if REQ_FAILFAST is set
 
The essence of the problems is that you're causing REQ_FAILFAST to
terminate commands with error on requeuing conditions, some of which are
relatively common on most SCSI devices.  While this may be the correct
behaviour for multi-path, it's certainly wrong for the previously
understood meaning of REQ_FAILFAST, which was don't retry on error,
which is why domain validation and other applications use it to control
error handling, but don't expect to get failures for a simple requeue
are now spitting errors.

I honestly can't see that, even for the multi-path case, returning an
error when we're over queue depth is the correct thing to do (it may not
matter to something like a symmetrix, but an array that has a non-zero
cost associated with a path change, like a CPQ HSV or the AVT
controllers, will show fairly large slow downs if you do this).  Even if
this is the desired behaviour (and I think that's a policy issue),
DID_NO_CONNECT is almost certainly the wrong error to be sending back.

This patch fixes up domain validation to work again correctly, however,
I really think it's just a bandaid.  Do you want to rethink the above
commit?

James

Index: BUILD-2.6/drivers/scsi/scsi_lib.c
===
--- BUILD-2.6.orig/drivers/scsi/scsi_lib.c  2007-11-24 11:25:20.0 
-0600
+++ BUILD-2.6/drivers/scsi/scsi_lib.c   2007-11-24 11:26:22.0 -0600
@@ -1552,7 +1552,8 @@ static void scsi_request_fn(struct reque
break;
 
if (!scsi_dev_queue_ready(q, sdev)) {
-   if (req->cmd_flags & REQ_FAILFAST) {
+   if ((req->cmd_flags & REQ_FAILFAST) &&
+   !(req->cmd_flags & REQ_PREEMPT)) {
scsi_kill_request(req, q);
continue;
}


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread James Bottomley
On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
> Le 24.11.2007 07:42, James Bottomley a écrit :
> > On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
> >> Le 23.11.2007 12:38, Hannes Reinecke a écrit :
> >>> Hannes Reinecke wrote:
>  Laurent Riffard wrote:
> > Le 21.11.2007 23:41, Andrew Morton a écrit :
> >> On Wed, 21 Nov 2007 22:45:22 +0100
> >> Laurent Riffard <[EMAIL PROTECTED]> wrote:
> >>
> >>> Le 21.11.2007 05:45, Andrew Morton a écrit :
>  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
> >>> Hello, 
> >>>
> >>> My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
> >>> that a bunch of task are blocked in "D" state, they seem to wait for
> >>> some I/O completion. I can try to hand-copy some data if requested.
> >>>
> >>> I found these messages in dmesg:
> >>>
> >>> ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
> >>> EXT3-fs: mounted filesystem with ordered data mode.
> >>> sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
> >>> driverbyte=DRIVER_OK,SUGGEST_OK
> >>> end_request: I/O error, dev sda, sector 16460
> >>> ReiserFS: sda7: found reiserfs format "3.6" with standard journal
> >>> ReiserFS: sda7: using ordered data mode
> >>> --
> >>> ReiserFS: sda7: Using r5 hash to sort names
> >>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
> >>> driverbyte=DRIVER_OK,SUGGEST_OK
> >>> end_request: I/O error, dev sdb, sector 19632
> >>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
> >>> driverbyte=DRIVER_OK,SUGGEST_OK
> >>> end_request: I/O error, dev sdb, sector 40037363
> >>> Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
> >>> extents:1 across:1048568k
> >>> lp0: using parport0 (interrupt-driven).
> >>>
> >>> These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
> >>> reproducible.
> >>> 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
> >>>
> >>> Maybe something is broken in pata_via driver ?
> >>>
> >> Could be - 
> >> libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
> >> and 
> >> pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
> >> touch pata_via.c.
> > None of the above...
> >
> > I did a bisection, it spotted git-scsi-misc.patch. 
> > I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works 
> > fine.
> >
> > I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do not 
> > requeue requests if REQ_FAILFAST is set" is the real culprit. The other 
> > commits are touching documentation or drivers I don't use. I'll try 
> > to revert only this one this evening.
> >> I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
> >> does fix the problem.
> >>
>  Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
>  error where
>  I shouldn't. Checking ...
> 
> >>> Ok, found it. We are blocking even special commands (ie requests with 
> >>> PREEMPT not set)
> >>> when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
> >>> this.
> >> Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
> >> errors.
> > 
> > I think the problem is the way we treat BLOCKED and QUIESCED (the latter
> > is the state that the domain validation uses and which we cannot kill
> > fastfail on).  It's definitely wrong to kill fastfail requests when the
> > state is QUIESCE.
> > 
> > This patch (which is applied on top of Hannes original) separates the
> > BLOCK and QUIESCE states correctly ... does this fix the problem?
> 
> 
> No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)

OK, could you post dmesgs again, please.  I actually tested this with an
aic79xx card, and for me it does cause Domain Validation to succeed
again.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread Laurent Riffard


Le 24.11.2007 07:42, James Bottomley a écrit :
> On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
>> Le 23.11.2007 12:38, Hannes Reinecke a écrit :
>>> Hannes Reinecke wrote:
 Laurent Riffard wrote:
> Le 21.11.2007 23:41, Andrew Morton a écrit :
>> On Wed, 21 Nov 2007 22:45:22 +0100
>> Laurent Riffard <[EMAIL PROTECTED]> wrote:
>>
>>> Le 21.11.2007 05:45, Andrew Morton a écrit :
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
>>> Hello, 
>>>
>>> My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
>>> that a bunch of task are blocked in "D" state, they seem to wait for
>>> some I/O completion. I can try to hand-copy some data if requested.
>>>
>>> I found these messages in dmesg:
>>>
>>> ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
>>> EXT3-fs: mounted filesystem with ordered data mode.
>>> sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
>>> driverbyte=DRIVER_OK,SUGGEST_OK
>>> end_request: I/O error, dev sda, sector 16460
>>> ReiserFS: sda7: found reiserfs format "3.6" with standard journal
>>> ReiserFS: sda7: using ordered data mode
>>> --
>>> ReiserFS: sda7: Using r5 hash to sort names
>>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
>>> driverbyte=DRIVER_OK,SUGGEST_OK
>>> end_request: I/O error, dev sdb, sector 19632
>>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
>>> driverbyte=DRIVER_OK,SUGGEST_OK
>>> end_request: I/O error, dev sdb, sector 40037363
>>> Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
>>> extents:1 across:1048568k
>>> lp0: using parport0 (interrupt-driven).
>>>
>>> These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
>>> reproducible.
>>> 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
>>>
>>> Maybe something is broken in pata_via driver ?
>>>
>> Could be - 
>> libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
>> and 
>> pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
>> touch pata_via.c.
> None of the above...
>
> I did a bisection, it spotted git-scsi-misc.patch. 
> I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine.
>
> I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do not 
> requeue requests if REQ_FAILFAST is set" is the real culprit. The other 
> commits are touching documentation or drivers I don't use. I'll try 
> to revert only this one this evening.
>> I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
>> does fix the problem.
>>
 Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
 error where
 I shouldn't. Checking ...

>>> Ok, found it. We are blocking even special commands (ie requests with 
>>> PREEMPT not set)
>>> when FAILFAST is set. Which is clearly wrong. The attached patch fixes this.
>> Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
>> errors.
> 
> I think the problem is the way we treat BLOCKED and QUIESCED (the latter
> is the state that the domain validation uses and which we cannot kill
> fastfail on).  It's definitely wrong to kill fastfail requests when the
> state is QUIESCE.
> 
> This patch (which is applied on top of Hannes original) separates the
> BLOCK and QUIESCE states correctly ... does this fix the problem?


No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)


> James
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 13e7e09..a7cf23a 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c

> @@ -1279,18 +1279,21 @@ int scsi_prep_state_check(struct scsi_device *sdev, 
> struct request *req)
>   "rejecting I/O to dead device\n");
>   ret = BLKPREP_KILL;
>   break;
> - case SDEV_QUIESCE:
>   case SDEV_BLOCK:
>   /*
> -  * If the devices is blocked we defer normal commands.
> -  */
> - if (!(req->cmd_flags & REQ_PREEMPT))
> - ret = BLKPREP_DEFER;
> - /*
>* Return failfast requests immediately
>*/
>   if (req->cmd_flags & REQ_FAILFAST)
>   ret = BLKPREP_KILL;
> +
> + /* fall through */
> +
> + case SDEV_QUIESCE:
> + /*
> +  * If the devices is blocked we defer normal commands.
> +  */
> + if (!(req->cmd_flags & REQ_PREEMPT))
> + ret = BLKPREP_DEFER;
>   break;
>   default:
>   /*
> 
-
To 

Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread Laurent Riffard


Le 24.11.2007 07:42, James Bottomley a écrit :
 On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
 Le 23.11.2007 12:38, Hannes Reinecke a écrit :
 Hannes Reinecke wrote:
 Laurent Riffard wrote:
 Le 21.11.2007 23:41, Andrew Morton a écrit :
 On Wed, 21 Nov 2007 22:45:22 +0100
 Laurent Riffard [EMAIL PROTECTED] wrote:

 Le 21.11.2007 05:45, Andrew Morton a écrit :
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
 Hello, 

 My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
 that a bunch of task are blocked in D state, they seem to wait for
 some I/O completion. I can try to hand-copy some data if requested.

 I found these messages in dmesg:

 ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
 EXT3-fs: mounted filesystem with ordered data mode.
 sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sda, sector 16460
 ReiserFS: sda7: found reiserfs format 3.6 with standard journal
 ReiserFS: sda7: using ordered data mode
 --
 ReiserFS: sda7: Using r5 hash to sort names
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 19632
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 40037363
 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
 extents:1 across:1048568k
 lp0: using parport0 (interrupt-driven).

 These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
 reproducible.
 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.

 Maybe something is broken in pata_via driver ?

 Could be - 
 libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
 and 
 pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
 touch pata_via.c.
 None of the above...

 I did a bisection, it spotted git-scsi-misc.patch. 
 I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine.

 I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not 
 requeue requests if REQ_FAILFAST is set is the real culprit. The other 
 commits are touching documentation or drivers I don't use. I'll try 
 to revert only this one this evening.
 I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
 does fix the problem.

 Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
 error where
 I shouldn't. Checking ...

 Ok, found it. We are blocking even special commands (ie requests with 
 PREEMPT not set)
 when FAILFAST is set. Which is clearly wrong. The attached patch fixes this.
 Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
 errors.
 
 I think the problem is the way we treat BLOCKED and QUIESCED (the latter
 is the state that the domain validation uses and which we cannot kill
 fastfail on).  It's definitely wrong to kill fastfail requests when the
 state is QUIESCE.
 
 This patch (which is applied on top of Hannes original) separates the
 BLOCK and QUIESCE states correctly ... does this fix the problem?


No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)


 James
 
 diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
 index 13e7e09..a7cf23a 100644
 --- a/drivers/scsi/scsi_lib.c
 +++ b/drivers/scsi/scsi_lib.c

 @@ -1279,18 +1279,21 @@ int scsi_prep_state_check(struct scsi_device *sdev, 
 struct request *req)
   rejecting I/O to dead device\n);
   ret = BLKPREP_KILL;
   break;
 - case SDEV_QUIESCE:
   case SDEV_BLOCK:
   /*
 -  * If the devices is blocked we defer normal commands.
 -  */
 - if (!(req-cmd_flags  REQ_PREEMPT))
 - ret = BLKPREP_DEFER;
 - /*
* Return failfast requests immediately
*/
   if (req-cmd_flags  REQ_FAILFAST)
   ret = BLKPREP_KILL;
 +
 + /* fall through */
 +
 + case SDEV_QUIESCE:
 + /*
 +  * If the devices is blocked we defer normal commands.
 +  */
 + if (!(req-cmd_flags  REQ_PREEMPT))
 + ret = BLKPREP_DEFER;
   break;
   default:
   /*
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread James Bottomley
On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
 Le 24.11.2007 07:42, James Bottomley a écrit :
  On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
  Le 23.11.2007 12:38, Hannes Reinecke a écrit :
  Hannes Reinecke wrote:
  Laurent Riffard wrote:
  Le 21.11.2007 23:41, Andrew Morton a écrit :
  On Wed, 21 Nov 2007 22:45:22 +0100
  Laurent Riffard [EMAIL PROTECTED] wrote:
 
  Le 21.11.2007 05:45, Andrew Morton a écrit :
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
  Hello, 
 
  My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
  that a bunch of task are blocked in D state, they seem to wait for
  some I/O completion. I can try to hand-copy some data if requested.
 
  I found these messages in dmesg:
 
  ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
  EXT3-fs: mounted filesystem with ordered data mode.
  sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
  driverbyte=DRIVER_OK,SUGGEST_OK
  end_request: I/O error, dev sda, sector 16460
  ReiserFS: sda7: found reiserfs format 3.6 with standard journal
  ReiserFS: sda7: using ordered data mode
  --
  ReiserFS: sda7: Using r5 hash to sort names
  sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
  driverbyte=DRIVER_OK,SUGGEST_OK
  end_request: I/O error, dev sdb, sector 19632
  sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
  driverbyte=DRIVER_OK,SUGGEST_OK
  end_request: I/O error, dev sdb, sector 40037363
  Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
  extents:1 across:1048568k
  lp0: using parport0 (interrupt-driven).
 
  These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
  reproducible.
  2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
 
  Maybe something is broken in pata_via driver ?
 
  Could be - 
  libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
  and 
  pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
  touch pata_via.c.
  None of the above...
 
  I did a bisection, it spotted git-scsi-misc.patch. 
  I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works 
  fine.
 
  I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not 
  requeue requests if REQ_FAILFAST is set is the real culprit. The other 
  commits are touching documentation or drivers I don't use. I'll try 
  to revert only this one this evening.
  I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
  does fix the problem.
 
  Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
  error where
  I shouldn't. Checking ...
 
  Ok, found it. We are blocking even special commands (ie requests with 
  PREEMPT not set)
  when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
  this.
  Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
  errors.
  
  I think the problem is the way we treat BLOCKED and QUIESCED (the latter
  is the state that the domain validation uses and which we cannot kill
  fastfail on).  It's definitely wrong to kill fastfail requests when the
  state is QUIESCE.
  
  This patch (which is applied on top of Hannes original) separates the
  BLOCK and QUIESCE states correctly ... does this fix the problem?
 
 
 No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)

OK, could you post dmesgs again, please.  I actually tested this with an
aic79xx card, and for me it does cause Domain Validation to succeed
again.

James


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread James Bottomley
Probing intermittent failures in Domain Validation, even with the fixes
applied leads me to the conclusion that there are further problems with
this commit:

commit fc5eb4facedbd6d7117905e775cee1975f894e79
Author: Hannes Reinecke [EMAIL PROTECTED]
Date:   Tue Nov 6 09:23:40 2007 +0100

[SCSI] Do not requeue requests if REQ_FAILFAST is set
 
The essence of the problems is that you're causing REQ_FAILFAST to
terminate commands with error on requeuing conditions, some of which are
relatively common on most SCSI devices.  While this may be the correct
behaviour for multi-path, it's certainly wrong for the previously
understood meaning of REQ_FAILFAST, which was don't retry on error,
which is why domain validation and other applications use it to control
error handling, but don't expect to get failures for a simple requeue
are now spitting errors.

I honestly can't see that, even for the multi-path case, returning an
error when we're over queue depth is the correct thing to do (it may not
matter to something like a symmetrix, but an array that has a non-zero
cost associated with a path change, like a CPQ HSV or the AVT
controllers, will show fairly large slow downs if you do this).  Even if
this is the desired behaviour (and I think that's a policy issue),
DID_NO_CONNECT is almost certainly the wrong error to be sending back.

This patch fixes up domain validation to work again correctly, however,
I really think it's just a bandaid.  Do you want to rethink the above
commit?

James

Index: BUILD-2.6/drivers/scsi/scsi_lib.c
===
--- BUILD-2.6.orig/drivers/scsi/scsi_lib.c  2007-11-24 11:25:20.0 
-0600
+++ BUILD-2.6/drivers/scsi/scsi_lib.c   2007-11-24 11:26:22.0 -0600
@@ -1552,7 +1552,8 @@ static void scsi_request_fn(struct reque
break;
 
if (!scsi_dev_queue_ready(q, sdev)) {
-   if (req-cmd_flags  REQ_FAILFAST) {
+   if ((req-cmd_flags  REQ_FAILFAST) 
+   !(req-cmd_flags  REQ_PREEMPT)) {
scsi_kill_request(req, q);
continue;
}


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread Gabriel C
James Bottomley wrote:
 On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
 Le 24.11.2007 07:42, James Bottomley a écrit :
 On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
 Le 23.11.2007 12:38, Hannes Reinecke a écrit :
 Hannes Reinecke wrote:
 Laurent Riffard wrote:
 Le 21.11.2007 23:41, Andrew Morton a écrit :
 On Wed, 21 Nov 2007 22:45:22 +0100
 Laurent Riffard [EMAIL PROTECTED] wrote:

 Le 21.11.2007 05:45, Andrew Morton a écrit :
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
 Hello, 

 My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
 that a bunch of task are blocked in D state, they seem to wait for
 some I/O completion. I can try to hand-copy some data if requested.

 I found these messages in dmesg:

 ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
 EXT3-fs: mounted filesystem with ordered data mode.
 sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sda, sector 16460
 ReiserFS: sda7: found reiserfs format 3.6 with standard journal
 ReiserFS: sda7: using ordered data mode
 --
 ReiserFS: sda7: Using r5 hash to sort names
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 19632
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 40037363
 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
 extents:1 across:1048568k
 lp0: using parport0 (interrupt-driven).

 These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
 reproducible.
 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.

 Maybe something is broken in pata_via driver ?

 Could be - 
 libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
 and 
 pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
 touch pata_via.c.
 None of the above...

 I did a bisection, it spotted git-scsi-misc.patch. 
 I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works 
 fine.

 I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not 
 requeue requests if REQ_FAILFAST is set is the real culprit. The other 
 commits are touching documentation or drivers I don't use. I'll try 
 to revert only this one this evening.
 I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
 does fix the problem.

 Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
 error where
 I shouldn't. Checking ...

 Ok, found it. We are blocking even special commands (ie requests with 
 PREEMPT not set)
 when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
 this.
 Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
 errors.
 I think the problem is the way we treat BLOCKED and QUIESCED (the latter
 is the state that the domain validation uses and which we cannot kill
 fastfail on).  It's definitely wrong to kill fastfail requests when the
 state is QUIESCE.

 This patch (which is applied on top of Hannes original) separates the
 BLOCK and QUIESCE states correctly ... does this fix the problem?

 No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)
 
 OK, could you post dmesgs again, please.  I actually tested this with an
 aic79xx card, and for me it does cause Domain Validation to succeed
 again.
 

Are the patches indeed to fix that problem as well ? 

http://lkml.org/lkml/2007/11/23/5

 James

Gabriel 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread James Bottomley

On Sat, 2007-11-24 at 18:54 +0100, Gabriel C wrote:
 James Bottomley wrote:
  On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
  Le 24.11.2007 07:42, James Bottomley a écrit :
  On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
  Le 23.11.2007 12:38, Hannes Reinecke a écrit :
  Hannes Reinecke wrote:
  Laurent Riffard wrote:
  Le 21.11.2007 23:41, Andrew Morton a écrit :
  On Wed, 21 Nov 2007 22:45:22 +0100
  Laurent Riffard [EMAIL PROTECTED] wrote:
 
  Le 21.11.2007 05:45, Andrew Morton a écrit :
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
  Hello, 
 
  My system hangs shortly after I logged in Gnome desktop. SysRq-W 
  shows
  that a bunch of task are blocked in D state, they seem to wait for
  some I/O completion. I can try to hand-copy some data if requested.
 
  I found these messages in dmesg:
 
  ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
  EXT3-fs: mounted filesystem with ordered data mode.
  sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
  driverbyte=DRIVER_OK,SUGGEST_OK
  end_request: I/O error, dev sda, sector 16460
  ReiserFS: sda7: found reiserfs format 3.6 with standard journal
  ReiserFS: sda7: using ordered data mode
  --
  ReiserFS: sda7: Using r5 hash to sort names
  sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
  driverbyte=DRIVER_OK,SUGGEST_OK
  end_request: I/O error, dev sdb, sector 19632
  sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
  driverbyte=DRIVER_OK,SUGGEST_OK
  end_request: I/O error, dev sdb, sector 40037363
  Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
  extents:1 across:1048568k
  lp0: using parport0 (interrupt-driven).
 
  These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
  reproducible.
  2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
 
  Maybe something is broken in pata_via driver ?
 
  Could be - 
  libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
  and 
  pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
  touch pata_via.c.
  None of the above...
 
  I did a bisection, it spotted git-scsi-misc.patch. 
  I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works 
  fine.
 
  I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do 
  not 
  requeue requests if REQ_FAILFAST is set is the real culprit. The 
  other 
  commits are touching documentation or drivers I don't use. I'll try 
  to revert only this one this evening.
  I can confirm : reverting commit 
  8655a546c83fc43f0a73416bbd126d02de7ad6c0 
  does fix the problem.
 
  Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
  error where
  I shouldn't. Checking ...
 
  Ok, found it. We are blocking even special commands (ie requests with 
  PREEMPT not set)
  when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
  this.
  Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
  errors.
  I think the problem is the way we treat BLOCKED and QUIESCED (the latter
  is the state that the domain validation uses and which we cannot kill
  fastfail on).  It's definitely wrong to kill fastfail requests when the
  state is QUIESCE.
 
  This patch (which is applied on top of Hannes original) separates the
  BLOCK and QUIESCE states correctly ... does this fix the problem?
 
  No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)
  
  OK, could you post dmesgs again, please.  I actually tested this with an
  aic79xx card, and for me it does cause Domain Validation to succeed
  again.
  
 
 Are the patches indeed to fix that problem as well ? 
 
 http://lkml.org/lkml/2007/11/23/5

That dmesg is from an unknown SCSI card exhibiting Domain Validation
problems, so it's a reasonable probability, yes ... but you'll need the
additional hack I just did to prevent further intermittent failures.

James


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread Gabriel C
James Bottomley wrote:
 On Sat, 2007-11-24 at 18:54 +0100, Gabriel C wrote:
 James Bottomley wrote:
 On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
 Le 24.11.2007 07:42, James Bottomley a écrit :
 On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
 Le 23.11.2007 12:38, Hannes Reinecke a écrit :
 Hannes Reinecke wrote:
 Laurent Riffard wrote:
 Le 21.11.2007 23:41, Andrew Morton a écrit :
 On Wed, 21 Nov 2007 22:45:22 +0100
 Laurent Riffard [EMAIL PROTECTED] wrote:

 Le 21.11.2007 05:45, Andrew Morton a écrit :
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
 Hello, 

 My system hangs shortly after I logged in Gnome desktop. SysRq-W 
 shows
 that a bunch of task are blocked in D state, they seem to wait for
 some I/O completion. I can try to hand-copy some data if requested.

 I found these messages in dmesg:

 ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
 EXT3-fs: mounted filesystem with ordered data mode.
 sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sda, sector 16460
 ReiserFS: sda7: found reiserfs format 3.6 with standard journal
 ReiserFS: sda7: using ordered data mode
 --
 ReiserFS: sda7: Using r5 hash to sort names
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 19632
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 40037363
 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
 extents:1 across:1048568k
 lp0: using parport0 (interrupt-driven).

 These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
 reproducible.
 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.

 Maybe something is broken in pata_via driver ?

 Could be - 
 libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
 and 
 pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
 touch pata_via.c.
 None of the above...

 I did a bisection, it spotted git-scsi-misc.patch. 
 I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works 
 fine.

 I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do 
 not 
 requeue requests if REQ_FAILFAST is set is the real culprit. The 
 other 
 commits are touching documentation or drivers I don't use. I'll try 
 to revert only this one this evening.
 I can confirm : reverting commit 
 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
 does fix the problem.

 Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
 error where
 I shouldn't. Checking ...

 Ok, found it. We are blocking even special commands (ie requests with 
 PREEMPT not set)
 when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
 this.
 Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
 errors.
 I think the problem is the way we treat BLOCKED and QUIESCED (the latter
 is the state that the domain validation uses and which we cannot kill
 fastfail on).  It's definitely wrong to kill fastfail requests when the
 state is QUIESCE.

 This patch (which is applied on top of Hannes original) separates the
 BLOCK and QUIESCE states correctly ... does this fix the problem?
 No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)
 OK, could you post dmesgs again, please.  I actually tested this with an
 aic79xx card, and for me it does cause Domain Validation to succeed
 again.

 Are the patches indeed to fix that problem as well ? 

 http://lkml.org/lkml/2007/11/23/5
 
 That dmesg is from an unknown SCSI card exhibiting Domain Validation
 problems, so it's a reasonable probability, yes ... but you'll need the
 additional hack I just did to prevent further intermittent failures.

My controller is:

03:0e.0 SCSI storage controller [0100]: Adaptec AIC-7892P U160/m [9005:008f] 
(rev 02)

I'll try the patches in a bit.

 
 James
 

Gabriel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread Gabriel C
Gabriel C wrote:
 James Bottomley wrote:
 On Sat, 2007-11-24 at 18:54 +0100, Gabriel C wrote:
 James Bottomley wrote:
 On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
 Le 24.11.2007 07:42, James Bottomley a écrit :
 On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
 Le 23.11.2007 12:38, Hannes Reinecke a écrit :
 Hannes Reinecke wrote:
 Laurent Riffard wrote:
 Le 21.11.2007 23:41, Andrew Morton a écrit :
 On Wed, 21 Nov 2007 22:45:22 +0100
 Laurent Riffard [EMAIL PROTECTED] wrote:

 Le 21.11.2007 05:45, Andrew Morton a écrit :
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
 Hello, 

 My system hangs shortly after I logged in Gnome desktop. SysRq-W 
 shows
 that a bunch of task are blocked in D state, they seem to wait 
 for
 some I/O completion. I can try to hand-copy some data if requested.

 I found these messages in dmesg:

 ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
 EXT3-fs: mounted filesystem with ordered data mode.
 sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sda, sector 16460
 ReiserFS: sda7: found reiserfs format 3.6 with standard journal
 ReiserFS: sda7: using ordered data mode
 --
 ReiserFS: sda7: Using r5 hash to sort names
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 19632
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 40037363
 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
 extents:1 across:1048568k
 lp0: using parport0 (interrupt-driven).

 These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
 reproducible.
 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.

 Maybe something is broken in pata_via driver ?

 Could be - 
 libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
 and 
 pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
 touch pata_via.c.
 None of the above...

 I did a bisection, it spotted git-scsi-misc.patch. 
 I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works 
 fine.

 I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do 
 not 
 requeue requests if REQ_FAILFAST is set is the real culprit. The 
 other 
 commits are touching documentation or drivers I don't use. I'll try 
 to revert only this one this evening.
 I can confirm : reverting commit 
 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
 does fix the problem.

 Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
 error where
 I shouldn't. Checking ...

 Ok, found it. We are blocking even special commands (ie requests with 
 PREEMPT not set)
 when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
 this.
 Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with 
 I/O errors.
 I think the problem is the way we treat BLOCKED and QUIESCED (the latter
 is the state that the domain validation uses and which we cannot kill
 fastfail on).  It's definitely wrong to kill fastfail requests when the
 state is QUIESCE.

 This patch (which is applied on top of Hannes original) separates the
 BLOCK and QUIESCE states correctly ... does this fix the problem?
 No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)
 OK, could you post dmesgs again, please.  I actually tested this with an
 aic79xx card, and for me it does cause Domain Validation to succeed
 again.

 Are the patches indeed to fix that problem as well ? 

 http://lkml.org/lkml/2007/11/23/5
 That dmesg is from an unknown SCSI card exhibiting Domain Validation
 problems, so it's a reasonable probability, yes ... but you'll need the
 additional hack I just did to prevent further intermittent failures.
 
 My controller is:
 
 03:0e.0 SCSI storage controller [0100]: Adaptec AIC-7892P U160/m [9005:008f] 
 (rev 02)
 
 I'll try the patches in a bit.

With your patches my problem(s) are solved. Domain Validation works again.

...

[   32.179521] scsi 0:0:0:0: Direct-Access SEAGATE  ST318406LW   0109 
PQ: 0 ANSI: 3
[   32.179540] scsi0:A:0:0: Tagged Queuing enabled.  Depth 32
[   32.179554]  target0:0:0: Beginning Domain Validation
[   32.188553]  target0:0:0: wide asynchronous
[   32.195302]  target0:0:0: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 
63)
[   32.206510]  target0:0:0: Ending Domain Validation
[   32.211699] scsi 0:0:1:0: Direct-Access FUJITSU  MAH3182MP0114 
PQ: 0 ANSI: 4
[   32.211707] scsi0:A:1:0: Tagged Queuing enabled.  Depth 32
[   32.211717]  target0:0:1: Beginning Domain Validation
[   32.213980]  target0:0:1: wide asynchronous
[   32.215682]  target0:0:1: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 
127)
[   32.220205]  target0:0:1: Ending Domain Validation

...

 Thx James :)


Gabriel

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL 

Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread Laurent Riffard
Le 24.11.2007 14:26, James Bottomley a écrit :
 On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
 Le 24.11.2007 07:42, James Bottomley a écrit :
 On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
 Le 23.11.2007 12:38, Hannes Reinecke a écrit :
[snip]
 I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
 does fix the problem.

 Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
 error where
 I shouldn't. Checking ...

 Ok, found it. We are blocking even special commands (ie requests with 
 PREEMPT not set)
 when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
 this.
 Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
 errors.
 I think the problem is the way we treat BLOCKED and QUIESCED (the latter
 is the state that the domain validation uses and which we cannot kill
 fastfail on).  It's definitely wrong to kill fastfail requests when the
 state is QUIESCE.

 This patch (which is applied on top of Hannes original) separates the
 BLOCK and QUIESCE states correctly ... does this fix the problem?

 No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)
 
 OK, could you post dmesgs again, please.  I actually tested this with an
 aic79xx card, and for me it does cause Domain Validation to succeed
 again.

James, 

Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch separates the 
BLOCK and QUIESCE states correctly (http://lkml.org/lkml/2007/11/24/8).

How to reproduce :
- boot
- switch to a text console
- capture dmesg in a file, sync, etc. There are 3 I/O errors, but the 
  system does work.
- switch to X console, log in the Gnome Desktop, the system partially 
  hangs.
- switch back to a text console: dmesg(1) still works, it shows some 
  additonal I/O errors. At this point, any disk access makes the system 
  completely hung.

Additionnal data:
- the I/O errors always happen on the same blocks.

-- 
laurent
[0.00] Linux version 2.6.24-rc3-mm1 ([EMAIL PROTECTED]) (gcc version 
4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)) #122 PREEMPT Fri Nov 23 
18:47:58 CET 2007
[0.00] BIOS-provided physical RAM map:
[0.00]  BIOS-e820:  - 0009fc00 (usable)
[0.00]  BIOS-e820: 0009fc00 - 000a (reserved)
[0.00]  BIOS-e820: 000f - 0010 (reserved)
[0.00]  BIOS-e820: 0010 - 1ffec000 (usable)
[0.00]  BIOS-e820: 1ffec000 - 1ffef000 (ACPI data)
[0.00]  BIOS-e820: 1ffef000 - 1000 (reserved)
[0.00]  BIOS-e820: 1000 - 2000 (ACPI NVS)
[0.00]  BIOS-e820:  - 0001 (reserved)
[0.00] 511MB LOWMEM available.
[0.00] Entering add_active_range(0, 0, 131052) 0 entries of 256 used
[0.00] sizeof(struct page) = 32
[0.00] Zone PFN ranges:
[0.00]   DMA 0 - 4096
[0.00]   Normal   4096 -   131052
[0.00] Movable zone start PFN for each node
[0.00] early_node_map[1] active PFN ranges
[0.00] 0:0 -   131052
[0.00] On node 0 totalpages: 131052
[0.00] Node 0 memmap at 0xC100 size 4194304 first pfn 0xC100
[0.00]   DMA zone: 32 pages used for memmap
[0.00]   DMA zone: 0 pages reserved
[0.00]   DMA zone: 4064 pages, LIFO batch:0
[0.00]   Normal zone: 991 pages used for memmap
[0.00]   Normal zone: 125965 pages, LIFO batch:31
[0.00]   Movable zone: 0 pages used for memmap
[0.00] DMI 2.3 present.
[0.00] ACPI: RSDP 000F6A80, 0014 (r0 ASUS  )
[0.00] ACPI: RSDT 1FFEC000, 002C (r1 ASUS   A7V133-C 30303031 MSFT 
31313031)
[0.00] ACPI: FACP 1FFEC080, 0074 (r1 ASUS   A7V133-C 30303031 MSFT 
31313031)
[0.00] ACPI: DSDT 1FFEC100, 2CE1 (r1   ASUS A7V133-C 1000 MSFT  
10B)
[0.00] ACPI: FACS 1000, 0040
[0.00] ACPI: BOOT 1FFEC040, 0028 (r1 ASUS   A7V133-C 30303031 MSFT 
31313031)
[0.00] ACPI: PM-Timer IO Port: 0xe408
[0.00] Allocating PCI resources starting at 3000 (gap: 
2000:dfff)
[0.00] swsusp: Registered nosave memory region: 0009f000 - 
000a
[0.00] swsusp: Registered nosave memory region: 000a - 
000f
[0.00] swsusp: Registered nosave memory region: 000f - 
0010
[0.00] Built 1 zonelists in Zone order, mobility grouping on.  Total 
pages: 130029
[0.00] Kernel command line: root=/dev/mapper/vglinux1-lv_ubuntu2 ro 
locale=fr_FR video=radeonfb:[EMAIL PROTECTED] resume=/dev/mapper/vglinux1-lvswap
[0.00] Local APIC disabled by BIOS -- you can enable it with lapic
[0.00] mapped APIC to b000 (01406000)
[0.00] Enabling fast FPU save and restore... done.
[0.00] Enabling unmasked SIMD FPU 

Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread James Bottomley
On Sat, 2007-11-24 at 23:59 +0100, Laurent Riffard wrote:
 Le 24.11.2007 14:26, James Bottomley a écrit :
  OK, could you post dmesgs again, please.  I actually tested this
 with an
  aic79xx card, and for me it does cause Domain Validation to succeed
  again.
 
 James, 
 
 Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch separates
 the 
 BLOCK and QUIESCE states
 correctly (http://lkml.org/lkml/2007/11/24/8).
 
 How to reproduce :
 - boot
 - switch to a text console
 - capture dmesg in a file, sync, etc. There are 3 I/O errors, but the 
   system does work.
 - switch to X console, log in the Gnome Desktop, the system partially 
   hangs.
 - switch back to a text console: dmesg(1) still works, it shows some 
   additonal I/O errors. At this point, any disk access makes the
 system 
   completely hung.
 
 Additionnal data:
 - the I/O errors always happen on the same blocks.
 
 plain text document attachment (dmesg-2.6.24-rc3-mm1-patched)
[...]
 [   25.521256] scsi0 : pata_via
 [   25.521711] scsi1 : pata_via
 [   25.524089] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma
 0xb800 irq 14
 [   25.524176] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma
 0xb808 irq 15
 [   25.683141] ata1.00: ATA-5: ST340016A, 3.75, max UDMA/100
 [   25.683208] ata1.00: 78165360 sectors, multi 16: LBA 
 [   25.683475] ata1.01: ATA-7: Maxtor 6Y080L0, YAR41BW0, max UDMA/133
 [   25.684116] ata1.01: 160086528 sectors, multi 16: LBA 
 [   25.691127] ata1.00: configured for UDMA/100
 [   25.699142] ata1.01: configured for UDMA/100
 [   26.170860] ata2.00: ATAPI: HL-DT-ST DVDRAM GSA-4165B, DL05, max
 UDMA/33
 [   26.171562] ata2.01: ATAPI: CD-950E/AKU, A4Q, max MWDMA2, CDB intr
 [   26.330839] ata2.00: configured for UDMA/33
 [   26.490828] ata2.01: configured for MWDMA2
 [   26.503014] scsi 0:0:0:0: Direct-Access ATA  ST340016A
 3.75 PQ: 0 ANSI: 5
 [   26.504670] scsi 0:0:1:0: Direct-Access ATA  Maxtor 6Y080L0
 YAR4 PQ: 0 ANSI: 5
 [   26.509842] scsi 1:0:0:0: CD-ROMHL-DT-ST DVDRAM
 GSA-4165B DL05 PQ: 0 ANSI: 5
 [   26.511673] scsi 1:0:1:0: CD-ROME-IDECD-950E/AKU
 A4Q  PQ: 0 ANSI: 5
[...]
 [   60.216113] sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT
 driverbyte=DRIVER_OK,SUGGEST_OK
 [   60.216124] end_request: I/O error, dev sda, sector 16460

I think this one's quite easy:  PATA devices in libata are queue depth 1
(since they don't do NCQ).  Thus, they're peculiarly sensitive to the
bug where we fail over queue depth requests.

On the other hand, I don't see how a filesystem request is getting
REQ_FAILFAST ... unless there's a bio or readahead issue involved.
Anyway, could you try this patch:

http://marc.info/?l=linux-scsim=119592627425498

Which should fix the queue depth issue, and see if the errors go away?

Thanks,

James


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-23 Thread James Bottomley

On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
> Le 23.11.2007 12:38, Hannes Reinecke a écrit :
> > Hannes Reinecke wrote:
> >> Laurent Riffard wrote:
> >>> Le 21.11.2007 23:41, Andrew Morton a écrit :
>  On Wed, 21 Nov 2007 22:45:22 +0100
>  Laurent Riffard <[EMAIL PROTECTED]> wrote:
> 
> > Le 21.11.2007 05:45, Andrew Morton a écrit :
> >> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
> > Hello, 
> >
> > My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
> > that a bunch of task are blocked in "D" state, they seem to wait for
> > some I/O completion. I can try to hand-copy some data if requested.
> >
> > I found these messages in dmesg:
> >
> > ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
> > EXT3-fs: mounted filesystem with ordered data mode.
> > sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
> > driverbyte=DRIVER_OK,SUGGEST_OK
> > end_request: I/O error, dev sda, sector 16460
> > ReiserFS: sda7: found reiserfs format "3.6" with standard journal
> > ReiserFS: sda7: using ordered data mode
> > --
> > ReiserFS: sda7: Using r5 hash to sort names
> > sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
> > driverbyte=DRIVER_OK,SUGGEST_OK
> > end_request: I/O error, dev sdb, sector 19632
> > sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
> > driverbyte=DRIVER_OK,SUGGEST_OK
> > end_request: I/O error, dev sdb, sector 40037363
> > Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
> > extents:1 across:1048568k
> > lp0: using parport0 (interrupt-driven).
> >
> > These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
> > reproducible.
> > 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
> >
> > Maybe something is broken in pata_via driver ?
> >
>  Could be - 
>  libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
>  and 
>  pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
>  touch pata_via.c.
> >>> None of the above...
> >>>
> >>> I did a bisection, it spotted git-scsi-misc.patch. 
> >>> I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine.
> >>>
> >>> I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do not 
> >>> requeue requests if REQ_FAILFAST is set" is the real culprit. The other 
> >>> commits are touching documentation or drivers I don't use. I'll try 
> >>> to revert only this one this evening.
> 
> I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
> does fix the problem.
> 
> >> Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
> >> error where
> >> I shouldn't. Checking ...
> >>
> > Ok, found it. We are blocking even special commands (ie requests with 
> > PREEMPT not set)
> > when FAILFAST is set. Which is clearly wrong. The attached patch fixes this.
> 
> Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
> errors.

I think the problem is the way we treat BLOCKED and QUIESCED (the latter
is the state that the domain validation uses and which we cannot kill
fastfail on).  It's definitely wrong to kill fastfail requests when the
state is QUIESCE.

This patch (which is applied on top of Hannes original) separates the
BLOCK and QUIESCE states correctly ... does this fix the problem?

James

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 13e7e09..a7cf23a 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1279,18 +1279,21 @@ int scsi_prep_state_check(struct scsi_device *sdev, 
struct request *req)
"rejecting I/O to dead device\n");
ret = BLKPREP_KILL;
break;
-   case SDEV_QUIESCE:
case SDEV_BLOCK:
/*
-* If the devices is blocked we defer normal commands.
-*/
-   if (!(req->cmd_flags & REQ_PREEMPT))
-   ret = BLKPREP_DEFER;
-   /*
 * Return failfast requests immediately
 */
if (req->cmd_flags & REQ_FAILFAST)
ret = BLKPREP_KILL;
+
+   /* fall through */
+
+   case SDEV_QUIESCE:
+   /*
+* If the devices is blocked we defer normal commands.
+*/
+   if (!(req->cmd_flags & REQ_PREEMPT))
+   ret = BLKPREP_DEFER;
break;
default:
/*


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-23 Thread Laurent Riffard
Le 23.11.2007 12:38, Hannes Reinecke a écrit :
> Hannes Reinecke wrote:
>> Laurent Riffard wrote:
>>> Le 21.11.2007 23:41, Andrew Morton a écrit :
 On Wed, 21 Nov 2007 22:45:22 +0100
 Laurent Riffard <[EMAIL PROTECTED]> wrote:

> Le 21.11.2007 05:45, Andrew Morton a écrit :
>> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
> Hello, 
>
> My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
> that a bunch of task are blocked in "D" state, they seem to wait for
> some I/O completion. I can try to hand-copy some data if requested.
>
> I found these messages in dmesg:
>
> ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
> EXT3-fs: mounted filesystem with ordered data mode.
> sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
> driverbyte=DRIVER_OK,SUGGEST_OK
> end_request: I/O error, dev sda, sector 16460
> ReiserFS: sda7: found reiserfs format "3.6" with standard journal
> ReiserFS: sda7: using ordered data mode
> --
> ReiserFS: sda7: Using r5 hash to sort names
> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
> driverbyte=DRIVER_OK,SUGGEST_OK
> end_request: I/O error, dev sdb, sector 19632
> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
> driverbyte=DRIVER_OK,SUGGEST_OK
> end_request: I/O error, dev sdb, sector 40037363
> Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
> extents:1 across:1048568k
> lp0: using parport0 (interrupt-driven).
>
> These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible.
> 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
>
> Maybe something is broken in pata_via driver ?
>
 Could be - 
 libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
 and 
 pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
 touch pata_via.c.
>>> None of the above...
>>>
>>> I did a bisection, it spotted git-scsi-misc.patch. 
>>> I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine.
>>>
>>> I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do not 
>>> requeue requests if REQ_FAILFAST is set" is the real culprit. The other 
>>> commits are touching documentation or drivers I don't use. I'll try 
>>> to revert only this one this evening.

I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
does fix the problem.

>> Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error 
>> where
>> I shouldn't. Checking ...
>>
> Ok, found it. We are blocking even special commands (ie requests with PREEMPT 
> not set)
> when FAILFAST is set. Which is clearly wrong. The attached patch fixes this.

Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O errors.

-- 
laurent

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-23 Thread Hannes Reinecke
Hannes Reinecke wrote:
> Laurent Riffard wrote:
>> Le 21.11.2007 23:41, Andrew Morton a écrit :
>>> On Wed, 21 Nov 2007 22:45:22 +0100
>>> Laurent Riffard <[EMAIL PROTECTED]> wrote:
>>>
 Le 21.11.2007 05:45, Andrew Morton a écrit :
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
 Hello, 

 My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
 that a bunch of task are blocked in "D" state, they seem to wait for
 some I/O completion. I can try to hand-copy some data if requested.

 I found these messages in dmesg:

 ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
 EXT3-fs: mounted filesystem with ordered data mode.
 sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sda, sector 16460
 ReiserFS: sda7: found reiserfs format "3.6" with standard journal
 ReiserFS: sda7: using ordered data mode
 --
 ReiserFS: sda7: Using r5 hash to sort names
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 19632
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 40037363
 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
 extents:1 across:1048568k
 lp0: using parport0 (interrupt-driven).

 These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible.
 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.

 Maybe something is broken in pata_via driver ?

>>> Could be - 
>>> libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
>>> and 
>>> pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
>>> touch pata_via.c.
>> None of the above...
>>
>> I did a bisection, it spotted git-scsi-misc.patch. 
>> I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine.
>>
>> I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do not 
>> requeue requests if REQ_FAILFAST is set" is the real culprit. The other 
>> commits are touching documentation or drivers I don't use. I'll try 
>> to revert only this one this evening.
>>
> Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error 
> where
> I shouldn't. Checking ...
> 
Ok, found it. We are blocking even special commands (ie requests with PREEMPT 
not set)
when FAILFAST is set. Which is clearly wrong. The attached patch fixes this.

James, please apply.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
[EMAIL PROTECTED] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
Fix SPI Domain validation

This fixes a thinko of the FAILFAST handling: when we get
a request with FAILFAST set, we still have to evaluate the
PREEMPT flag to decide if this request should be passed through.

Signed-off-by: Hannes Reinecke <[EMAIL PROTECTED]>

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 13e7e09..9ec1566 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1284,13 +1284,15 @@ int scsi_prep_state_check(struct scsi_device *sdev, 
struct request *req)
/*
 * If the devices is blocked we defer normal commands.
 */
-   if (!(req->cmd_flags & REQ_PREEMPT))
-   ret = BLKPREP_DEFER;
-   /*
-* Return failfast requests immediately
-*/
-   if (req->cmd_flags & REQ_FAILFAST)
-   ret = BLKPREP_KILL;
+   if (!(req->cmd_flags & REQ_PREEMPT)) {
+   /*
+* Return failfast requests immediately
+*/
+   if (req->cmd_flags & REQ_FAILFAST)
+   ret = BLKPREP_KILL;
+   else
+   ret = BLKPREP_DEFER;
+   }
break;
default:
/*


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-23 Thread Laurent Riffard
Le 23.11.2007 12:38, Hannes Reinecke a écrit :
 Hannes Reinecke wrote:
 Laurent Riffard wrote:
 Le 21.11.2007 23:41, Andrew Morton a écrit :
 On Wed, 21 Nov 2007 22:45:22 +0100
 Laurent Riffard [EMAIL PROTECTED] wrote:

 Le 21.11.2007 05:45, Andrew Morton a écrit :
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
 Hello, 

 My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
 that a bunch of task are blocked in D state, they seem to wait for
 some I/O completion. I can try to hand-copy some data if requested.

 I found these messages in dmesg:

 ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
 EXT3-fs: mounted filesystem with ordered data mode.
 sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sda, sector 16460
 ReiserFS: sda7: found reiserfs format 3.6 with standard journal
 ReiserFS: sda7: using ordered data mode
 --
 ReiserFS: sda7: Using r5 hash to sort names
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 19632
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 40037363
 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
 extents:1 across:1048568k
 lp0: using parport0 (interrupt-driven).

 These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible.
 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.

 Maybe something is broken in pata_via driver ?

 Could be - 
 libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
 and 
 pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
 touch pata_via.c.
 None of the above...

 I did a bisection, it spotted git-scsi-misc.patch. 
 I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine.

 I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not 
 requeue requests if REQ_FAILFAST is set is the real culprit. The other 
 commits are touching documentation or drivers I don't use. I'll try 
 to revert only this one this evening.

I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
does fix the problem.

 Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error 
 where
 I shouldn't. Checking ...

 Ok, found it. We are blocking even special commands (ie requests with PREEMPT 
 not set)
 when FAILFAST is set. Which is clearly wrong. The attached patch fixes this.

Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O errors.

-- 
laurent

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-23 Thread Hannes Reinecke
Hannes Reinecke wrote:
 Laurent Riffard wrote:
 Le 21.11.2007 23:41, Andrew Morton a écrit :
 On Wed, 21 Nov 2007 22:45:22 +0100
 Laurent Riffard [EMAIL PROTECTED] wrote:

 Le 21.11.2007 05:45, Andrew Morton a écrit :
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
 Hello, 

 My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
 that a bunch of task are blocked in D state, they seem to wait for
 some I/O completion. I can try to hand-copy some data if requested.

 I found these messages in dmesg:

 ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
 EXT3-fs: mounted filesystem with ordered data mode.
 sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sda, sector 16460
 ReiserFS: sda7: found reiserfs format 3.6 with standard journal
 ReiserFS: sda7: using ordered data mode
 --
 ReiserFS: sda7: Using r5 hash to sort names
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 19632
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 40037363
 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
 extents:1 across:1048568k
 lp0: using parport0 (interrupt-driven).

 These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible.
 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.

 Maybe something is broken in pata_via driver ?

 Could be - 
 libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
 and 
 pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
 touch pata_via.c.
 None of the above...

 I did a bisection, it spotted git-scsi-misc.patch. 
 I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine.

 I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not 
 requeue requests if REQ_FAILFAST is set is the real culprit. The other 
 commits are touching documentation or drivers I don't use. I'll try 
 to revert only this one this evening.

 Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error 
 where
 I shouldn't. Checking ...
 
Ok, found it. We are blocking even special commands (ie requests with PREEMPT 
not set)
when FAILFAST is set. Which is clearly wrong. The attached patch fixes this.

James, please apply.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
[EMAIL PROTECTED] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
Fix SPI Domain validation

This fixes a thinko of the FAILFAST handling: when we get
a request with FAILFAST set, we still have to evaluate the
PREEMPT flag to decide if this request should be passed through.

Signed-off-by: Hannes Reinecke [EMAIL PROTECTED]

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 13e7e09..9ec1566 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1284,13 +1284,15 @@ int scsi_prep_state_check(struct scsi_device *sdev, 
struct request *req)
/*
 * If the devices is blocked we defer normal commands.
 */
-   if (!(req-cmd_flags  REQ_PREEMPT))
-   ret = BLKPREP_DEFER;
-   /*
-* Return failfast requests immediately
-*/
-   if (req-cmd_flags  REQ_FAILFAST)
-   ret = BLKPREP_KILL;
+   if (!(req-cmd_flags  REQ_PREEMPT)) {
+   /*
+* Return failfast requests immediately
+*/
+   if (req-cmd_flags  REQ_FAILFAST)
+   ret = BLKPREP_KILL;
+   else
+   ret = BLKPREP_DEFER;
+   }
break;
default:
/*


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-23 Thread James Bottomley

On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
 Le 23.11.2007 12:38, Hannes Reinecke a écrit :
  Hannes Reinecke wrote:
  Laurent Riffard wrote:
  Le 21.11.2007 23:41, Andrew Morton a écrit :
  On Wed, 21 Nov 2007 22:45:22 +0100
  Laurent Riffard [EMAIL PROTECTED] wrote:
 
  Le 21.11.2007 05:45, Andrew Morton a écrit :
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
  Hello, 
 
  My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
  that a bunch of task are blocked in D state, they seem to wait for
  some I/O completion. I can try to hand-copy some data if requested.
 
  I found these messages in dmesg:
 
  ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
  EXT3-fs: mounted filesystem with ordered data mode.
  sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
  driverbyte=DRIVER_OK,SUGGEST_OK
  end_request: I/O error, dev sda, sector 16460
  ReiserFS: sda7: found reiserfs format 3.6 with standard journal
  ReiserFS: sda7: using ordered data mode
  --
  ReiserFS: sda7: Using r5 hash to sort names
  sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
  driverbyte=DRIVER_OK,SUGGEST_OK
  end_request: I/O error, dev sdb, sector 19632
  sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
  driverbyte=DRIVER_OK,SUGGEST_OK
  end_request: I/O error, dev sdb, sector 40037363
  Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
  extents:1 across:1048568k
  lp0: using parport0 (interrupt-driven).
 
  These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
  reproducible.
  2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
 
  Maybe something is broken in pata_via driver ?
 
  Could be - 
  libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
  and 
  pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
  touch pata_via.c.
  None of the above...
 
  I did a bisection, it spotted git-scsi-misc.patch. 
  I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine.
 
  I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not 
  requeue requests if REQ_FAILFAST is set is the real culprit. The other 
  commits are touching documentation or drivers I don't use. I'll try 
  to revert only this one this evening.
 
 I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
 does fix the problem.
 
  Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
  error where
  I shouldn't. Checking ...
 
  Ok, found it. We are blocking even special commands (ie requests with 
  PREEMPT not set)
  when FAILFAST is set. Which is clearly wrong. The attached patch fixes this.
 
 Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
 errors.

I think the problem is the way we treat BLOCKED and QUIESCED (the latter
is the state that the domain validation uses and which we cannot kill
fastfail on).  It's definitely wrong to kill fastfail requests when the
state is QUIESCE.

This patch (which is applied on top of Hannes original) separates the
BLOCK and QUIESCE states correctly ... does this fix the problem?

James

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 13e7e09..a7cf23a 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1279,18 +1279,21 @@ int scsi_prep_state_check(struct scsi_device *sdev, 
struct request *req)
rejecting I/O to dead device\n);
ret = BLKPREP_KILL;
break;
-   case SDEV_QUIESCE:
case SDEV_BLOCK:
/*
-* If the devices is blocked we defer normal commands.
-*/
-   if (!(req-cmd_flags  REQ_PREEMPT))
-   ret = BLKPREP_DEFER;
-   /*
 * Return failfast requests immediately
 */
if (req-cmd_flags  REQ_FAILFAST)
ret = BLKPREP_KILL;
+
+   /* fall through */
+
+   case SDEV_QUIESCE:
+   /*
+* If the devices is blocked we defer normal commands.
+*/
+   if (!(req-cmd_flags  REQ_PREEMPT))
+   ret = BLKPREP_DEFER;
break;
default:
/*


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-22 Thread Hannes Reinecke
Laurent Riffard wrote:
> Le 21.11.2007 23:41, Andrew Morton a écrit :
>> On Wed, 21 Nov 2007 22:45:22 +0100
>> Laurent Riffard <[EMAIL PROTECTED]> wrote:
>>
>>> Le 21.11.2007 05:45, Andrew Morton a écrit :
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
>>> Hello, 
>>>
>>> My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
>>> that a bunch of task are blocked in "D" state, they seem to wait for
>>> some I/O completion. I can try to hand-copy some data if requested.
>>>
>>> I found these messages in dmesg:
>>>
>>> ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
>>> EXT3-fs: mounted filesystem with ordered data mode.
>>> sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
>>> driverbyte=DRIVER_OK,SUGGEST_OK
>>> end_request: I/O error, dev sda, sector 16460
>>> ReiserFS: sda7: found reiserfs format "3.6" with standard journal
>>> ReiserFS: sda7: using ordered data mode
>>> --
>>> ReiserFS: sda7: Using r5 hash to sort names
>>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
>>> driverbyte=DRIVER_OK,SUGGEST_OK
>>> end_request: I/O error, dev sdb, sector 19632
>>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
>>> driverbyte=DRIVER_OK,SUGGEST_OK
>>> end_request: I/O error, dev sdb, sector 40037363
>>> Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 extents:1 
>>> across:1048568k
>>> lp0: using parport0 (interrupt-driven).
>>>
>>> These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible.
>>> 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
>>>
>>> Maybe something is broken in pata_via driver ?
>>>
>> Could be - 
>> libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
>> and 
>> pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
>> touch pata_via.c.
> 
> None of the above...
> 
> I did a bisection, it spotted git-scsi-misc.patch. 
> I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine.
> 
> I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do not 
> requeue requests if REQ_FAILFAST is set" is the real culprit. The other 
> commits are touching documentation or drivers I don't use. I'll try 
> to revert only this one this evening.
> 
Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error 
where
I shouldn't. Checking ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
[EMAIL PROTECTED] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-22 Thread Laurent Riffard

Le 21.11.2007 23:41, Andrew Morton a écrit :
> On Wed, 21 Nov 2007 22:45:22 +0100
> Laurent Riffard <[EMAIL PROTECTED]> wrote:
> 
>> Le 21.11.2007 05:45, Andrew Morton a écrit :
>>> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
>> Hello, 
>>
>> My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
>> that a bunch of task are blocked in "D" state, they seem to wait for
>> some I/O completion. I can try to hand-copy some data if requested.
>>
>> I found these messages in dmesg:
>>
>> ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
>> EXT3-fs: mounted filesystem with ordered data mode.
>> sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
>> driverbyte=DRIVER_OK,SUGGEST_OK
>> end_request: I/O error, dev sda, sector 16460
>> ReiserFS: sda7: found reiserfs format "3.6" with standard journal
>> ReiserFS: sda7: using ordered data mode
>> --
>> ReiserFS: sda7: Using r5 hash to sort names
>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
>> driverbyte=DRIVER_OK,SUGGEST_OK
>> end_request: I/O error, dev sdb, sector 19632
>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
>> driverbyte=DRIVER_OK,SUGGEST_OK
>> end_request: I/O error, dev sdb, sector 40037363
>> Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 extents:1 
>> across:1048568k
>> lp0: using parport0 (interrupt-driven).
>>
>> These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible.
>> 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
>>
>> Maybe something is broken in pata_via driver ?
>>
> 
> Could be - 
> libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
> and 
> pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
> touch pata_via.c.

None of the above...

I did a bisection, it spotted git-scsi-misc.patch. 
I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine.

I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do not 
requeue requests if REQ_FAILFAST is set" is the real culprit. The other 
commits are touching documentation or drivers I don't use. I'll try 
to revert only this one this evening.

-- 
laurent


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-22 Thread Hannes Reinecke
Laurent Riffard wrote:
 Le 21.11.2007 23:41, Andrew Morton a écrit :
 On Wed, 21 Nov 2007 22:45:22 +0100
 Laurent Riffard [EMAIL PROTECTED] wrote:

 Le 21.11.2007 05:45, Andrew Morton a écrit :
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
 Hello, 

 My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
 that a bunch of task are blocked in D state, they seem to wait for
 some I/O completion. I can try to hand-copy some data if requested.

 I found these messages in dmesg:

 ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
 EXT3-fs: mounted filesystem with ordered data mode.
 sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sda, sector 16460
 ReiserFS: sda7: found reiserfs format 3.6 with standard journal
 ReiserFS: sda7: using ordered data mode
 --
 ReiserFS: sda7: Using r5 hash to sort names
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 19632
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 40037363
 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 extents:1 
 across:1048568k
 lp0: using parport0 (interrupt-driven).

 These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible.
 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.

 Maybe something is broken in pata_via driver ?

 Could be - 
 libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
 and 
 pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
 touch pata_via.c.
 
 None of the above...
 
 I did a bisection, it spotted git-scsi-misc.patch. 
 I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine.
 
 I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not 
 requeue requests if REQ_FAILFAST is set is the real culprit. The other 
 commits are touching documentation or drivers I don't use. I'll try 
 to revert only this one this evening.
 
Hmm. Weird. I'll have a look into it. Apparently I'll be returning an error 
where
I shouldn't. Checking ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
[EMAIL PROTECTED] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-22 Thread Laurent Riffard

Le 21.11.2007 23:41, Andrew Morton a écrit :
 On Wed, 21 Nov 2007 22:45:22 +0100
 Laurent Riffard [EMAIL PROTECTED] wrote:
 
 Le 21.11.2007 05:45, Andrew Morton a écrit :
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
 Hello, 

 My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
 that a bunch of task are blocked in D state, they seem to wait for
 some I/O completion. I can try to hand-copy some data if requested.

 I found these messages in dmesg:

 ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
 EXT3-fs: mounted filesystem with ordered data mode.
 sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sda, sector 16460
 ReiserFS: sda7: found reiserfs format 3.6 with standard journal
 ReiserFS: sda7: using ordered data mode
 --
 ReiserFS: sda7: Using r5 hash to sort names
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 19632
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 40037363
 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 extents:1 
 across:1048568k
 lp0: using parport0 (interrupt-driven).

 These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible.
 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.

 Maybe something is broken in pata_via driver ?

 
 Could be - 
 libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
 and 
 pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
 touch pata_via.c.

None of the above...

I did a bisection, it spotted git-scsi-misc.patch. 
I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine.

I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not 
requeue requests if REQ_FAILFAST is set is the real culprit. The other 
commits are touching documentation or drivers I don't use. I'll try 
to revert only this one this evening.

-- 
laurent


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-21 Thread Andrew Morton
On Wed, 21 Nov 2007 22:45:22 +0100
Laurent Riffard <[EMAIL PROTECTED]> wrote:

> Le 21.11.2007 05:45, Andrew Morton a écrit :
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
> 
> Hello, 
> 
> My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
> that a bunch of task are blocked in "D" state, they seem to wait for
> some I/O completion. I can try to hand-copy some data if requested.
> 
> I found these messages in dmesg:
> 
> ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
> EXT3-fs: mounted filesystem with ordered data mode.
> sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
> driverbyte=DRIVER_OK,SUGGEST_OK
> end_request: I/O error, dev sda, sector 16460
> ReiserFS: sda7: found reiserfs format "3.6" with standard journal
> ReiserFS: sda7: using ordered data mode
> --
> ReiserFS: sda7: Using r5 hash to sort names
> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
> driverbyte=DRIVER_OK,SUGGEST_OK
> end_request: I/O error, dev sdb, sector 19632
> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
> driverbyte=DRIVER_OK,SUGGEST_OK
> end_request: I/O error, dev sdb, sector 40037363
> Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 extents:1 
> across:1048568k
> lp0: using parport0 (interrupt-driven).
> 
> These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible.
> 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
> 
> Maybe something is broken in pata_via driver ?
> 

Could be - 
libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
and pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
touch pata_via.c.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-21 Thread Andrew Morton
On Wed, 21 Nov 2007 22:45:22 +0100
Laurent Riffard [EMAIL PROTECTED] wrote:

 Le 21.11.2007 05:45, Andrew Morton a écrit :
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
 
 Hello, 
 
 My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
 that a bunch of task are blocked in D state, they seem to wait for
 some I/O completion. I can try to hand-copy some data if requested.
 
 I found these messages in dmesg:
 
 ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
 EXT3-fs: mounted filesystem with ordered data mode.
 sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sda, sector 16460
 ReiserFS: sda7: found reiserfs format 3.6 with standard journal
 ReiserFS: sda7: using ordered data mode
 --
 ReiserFS: sda7: Using r5 hash to sort names
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 19632
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 40037363
 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 extents:1 
 across:1048568k
 lp0: using parport0 (interrupt-driven).
 
 These errors occur *only* with 2.6.24-rc3-mm1, they are 100% reproducible.
 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
 
 Maybe something is broken in pata_via driver ?
 

Could be - 
libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
and pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
touch pata_via.c.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/