subject:"sata_sil24 broken since 2.6.23\-rc4\-mm1"

On Thu, Oct 11 2007, Tejun Heo wrote:
> Jens Axboe wrote:
> > This is the old ata_sg_is_last:
> > 
> > static inline int
> > ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
> > {
> > if (sg == >pad_sgent)
> > return 1;
> > if (qc->pad_len)
> > return 0;
> > if (((sg - qc->__sg) + 1) == qc->n_elem)
> > return 1;
> > return 0;
> > }
> > 
> > and the new one:
> > 
> > static inline int
> > ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
> > {
> > if (sg == >pad_sgent)
> > return 1;
> > if (qc->pad_len)
> > return 0;
> > if (qc->n_iter == qc->n_elem)
> > return 1; 
> > return 0;
> > }
> > 
> > ->n_iter is how ata_qc_next_sg() walks over the sglist, I don't
> > understand your reference to why depending on that during iteration
> > would be bad?
> 
> Because that makes ata_sg iterator macros have hidden side effects
> (nothing in the interface suggests it can't be nested and when somebody
> actually nests it, finding what went wrong can be pretty difficult).  I
> think it would be better to have explicit ata_sg_iter passed around if
> sg itself isn't enough to walk the sg list.

I think it's implicit in the interface, since you don't pass an sgtable
in. But it's not a big deal to me, if you want it changed, go ahead :-)

> > So we could add a test for sg_last() there, but that would turn sg table
> > iteration into an O(N^2) operation for those drivers that use
> > ata_sg_is_last() with chained sg tables. I'd much rather just get rid of
> > ata_sg_is_last(), it's only used to mark end-of-table entries for
> > hardware. That logic can be performed cheaper.
> 
> Yeap, it can be removed but having "is this the last one?" test is just
> nicer to low level drivers.  With ata_sg_iter, I think we can do it in
> O(N) by looking up and caching the next entry.

Sure, of you could just lookup sg_last() in the beginning of the loop.
Still seems a bit silly to me just to keep the ata_sg_is_last()
interface, since it can be done for free basically by maintaining an
sglast element in the loop that you then use when the loop is done to
set your SG_END marker (or whatever the driver needs).

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

Jens Axboe wrote:
> This is the old ata_sg_is_last:
> 
> static inline int
> ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
> {
> if (sg == >pad_sgent)
> return 1;
> if (qc->pad_len)
> return 0;
> if (((sg - qc->__sg) + 1) == qc->n_elem)
> return 1;
> return 0;
> }
> 
> and the new one:
> 
> static inline int
> ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
> {
> if (sg == >pad_sgent)
> return 1;
> if (qc->pad_len)
> return 0;
> if (qc->n_iter == qc->n_elem)
> return 1; 
> return 0;
> }
> 
> ->n_iter is how ata_qc_next_sg() walks over the sglist, I don't
> understand your reference to why depending on that during iteration
> would be bad?

Because that makes ata_sg iterator macros have hidden side effects
(nothing in the interface suggests it can't be nested and when somebody
actually nests it, finding what went wrong can be pretty difficult).  I
think it would be better to have explicit ata_sg_iter passed around if
sg itself isn't enough to walk the sg list.

> So we could add a test for sg_last() there, but that would turn sg table
> iteration into an O(N^2) operation for those drivers that use
> ata_sg_is_last() with chained sg tables. I'd much rather just get rid of
> ata_sg_is_last(), it's only used to mark end-of-table entries for
> hardware. That logic can be performed cheaper.

Yeap, it can be removed but having "is this the last one?" test is just
nicer to low level drivers.  With ata_sg_iter, I think we can do it in
O(N) by looking up and caching the next entry.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On Thu, Oct 11 2007, Tejun Heo wrote:
> Torsten Kaiser wrote:
> > Looking closer at
> > http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git;a=commitdiff;h=ec6fdded4d76aa54aa57341e5dfdd61c507b1dcd
> > the change to libata.h seems bogus :
> > 
> > in ata_qc_first_sg:
> > oldnew
> > return qc->__sgreturn qc->__sg
> > qc->__sg - qc->__sg == 0   qc->n_iter=0
> > -> sg - qc->__sg corresponds to qc->n_iter
> > 
> > in ata_qc_next_sg:
> > sg++;  sg_next(sg); qc->n_iter++;
> > sg - qc->__sg < qc->n_elem qc->n_iter < qc->nelem
> > -> sg - qc->__sg corresponds to qc->n_iter
> > 
> > but in ata_sg_is_last:
> > (sg - qc->__sg) +1 == qc->n_elem   qc->n_iter == qc->n_elem
> > if sg - qc->__sg corresponds to qc->n_iter then shoudn't it be
> > qc->n_iter+1 == qc->n_elem?
> > 
> > That missing +1 would explain, why the SGE_TRM never gets set.
> 
> Thanks a lot for tracking this down.  Does changing the above code fix
> your problem?
> 
> Jens, Torsten's analysis looks correct && depending on qc state (n_iter)
> during iteration doesn't look like a good idea.  Those iterators are not
> supposed to have side effects.  Would it be difficult to implement
> sg_last() test?

This is the old ata_sg_is_last:

static inline int
ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
{
if (sg == >pad_sgent)
return 1;
if (qc->pad_len)
return 0;
if (((sg - qc->__sg) + 1) == qc->n_elem)
return 1;
return 0;
}

and the new one:

static inline int
ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
{
if (sg == >pad_sgent)
return 1;
if (qc->pad_len)
return 0;
if (qc->n_iter == qc->n_elem)
return 1; 
return 0;
}

->n_iter is how ata_qc_next_sg() walks over the sglist, I don't
understand your reference to why depending on that during iteration
would be bad?

So we could add a test for sg_last() there, but that would turn sg table
iteration into an O(N^2) operation for those drivers that use
ata_sg_is_last() with chained sg tables. I'd much rather just get rid of
ata_sg_is_last(), it's only used to mark end-of-table entries for
hardware. That logic can be performed cheaper.

Torsten, your analysis does look correct. Does it work with this simple
patch?

diff --git a/include/linux/libata.h b/include/linux/libata.h
index 2784163..0152bf4 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -947,7 +947,7 @@ ata_sg_is_last(struct scatterlist *sg, struct 
ata_queued_cmd *qc)
return 1;
if (qc->pad_len)
return 0;
-   if (qc->n_iter == qc->n_elem)
+   if ((qc->n_iter + 1) == qc->n_elem)
return 1;
return 0;
 }

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

Torsten Kaiser wrote:
>>> That missing +1 would explain, why the SGE_TRM never gets set.
>> Thanks a lot for tracking this down.  Does changing the above code fix
>> your problem?
> 
> I did not try it.
> I'm not an libata expert and while this change looks suspicios, I
> can't be 100% sure if that change was intended.
> And I did not want to experiment this deep in the code and risk
> corrupting the hole drive.

I don't think you would risk too much by changing that bit of code.
Please try it.

>>> But I'm still not understanding, how the kernel could only fail
>>> sometimes at bootup, but after that working without any visible
>>> errors? Is the sil-chip rather intelligent about detecting corrupted
>>> sglists and silently ignoring them?
>> I have no idea why it fails only sometimes.
> 
> And that is, why I'm so unsure.
> The error looks to serious to only cause random failures on one of two
> drives on bootup.
> I never had trouble with the remaining drive on the SiI-chip or both
> drives if one got killed during booting.
> 
> I'm guessing that leaving the computer powered down long enough fills
> the RAM with a special pattern that really hangs the drive, while
> normaly it would just reject the invalid data. (I have ECC-RAM, does
> this matter?)
> 
> Another guess might be that most of the time the Sil-chip correctly
> terminates after the transfer-length is reached, even if SGE_TRM is
> missing...

I have no idea either.  We'll probably need a PCI bus tracer to tell
exactly what's going on.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

Torsten Kaiser wrote:
 That missing +1 would explain, why the SGE_TRM never gets set.
 Thanks a lot for tracking this down.  Does changing the above code fix
 your problem?
 
 I did not try it.
 I'm not an libata expert and while this change looks suspicios, I
 can't be 100% sure if that change was intended.
 And I did not want to experiment this deep in the code and risk
 corrupting the hole drive.

I don't think you would risk too much by changing that bit of code.
Please try it.

 But I'm still not understanding, how the kernel could only fail
 sometimes at bootup, but after that working without any visible
 errors? Is the sil-chip rather intelligent about detecting corrupted
 sglists and silently ignoring them?
 I have no idea why it fails only sometimes.
 
 And that is, why I'm so unsure.
 The error looks to serious to only cause random failures on one of two
 drives on bootup.
 I never had trouble with the remaining drive on the SiI-chip or both
 drives if one got killed during booting.
 
 I'm guessing that leaving the computer powered down long enough fills
 the RAM with a special pattern that really hangs the drive, while
 normaly it would just reject the invalid data. (I have ECC-RAM, does
 this matter?)
 
 Another guess might be that most of the time the Sil-chip correctly
 terminates after the transfer-length is reached, even if SGE_TRM is
 missing...

I have no idea either.  We'll probably need a PCI bus tracer to tell
exactly what's going on.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On Thu, Oct 11 2007, Tejun Heo wrote:
 Torsten Kaiser wrote:
  Looking closer at
  http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git;a=commitdiff;h=ec6fdded4d76aa54aa57341e5dfdd61c507b1dcd
  the change to libata.h seems bogus :
  
  in ata_qc_first_sg:
  oldnew
  return qc-__sgreturn qc-__sg
  qc-__sg - qc-__sg == 0   qc-n_iter=0
  - sg - qc-__sg corresponds to qc-n_iter
  
  in ata_qc_next_sg:
  sg++;  sg_next(sg); qc-n_iter++;
  sg - qc-__sg  qc-n_elem qc-n_iter  qc-nelem
  - sg - qc-__sg corresponds to qc-n_iter
  
  but in ata_sg_is_last:
  (sg - qc-__sg) +1 == qc-n_elem   qc-n_iter == qc-n_elem
  if sg - qc-__sg corresponds to qc-n_iter then shoudn't it be
  qc-n_iter+1 == qc-n_elem?
  
  That missing +1 would explain, why the SGE_TRM never gets set.
 
 Thanks a lot for tracking this down.  Does changing the above code fix
 your problem?
 
 Jens, Torsten's analysis looks correct  depending on qc state (n_iter)
 during iteration doesn't look like a good idea.  Those iterators are not
 supposed to have side effects.  Would it be difficult to implement
 sg_last() test?

This is the old ata_sg_is_last:

static inline int
ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
{
if (sg == qc-pad_sgent)
return 1;
if (qc-pad_len)
return 0;
if (((sg - qc-__sg) + 1) == qc-n_elem)
return 1;
return 0;
}

and the new one:

static inline int
ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
{
if (sg == qc-pad_sgent)
return 1;
if (qc-pad_len)
return 0;
if (qc-n_iter == qc-n_elem)
return 1; 
return 0;
}

-n_iter is how ata_qc_next_sg() walks over the sglist, I don't
understand your reference to why depending on that during iteration
would be bad?

So we could add a test for sg_last() there, but that would turn sg table
iteration into an O(N^2) operation for those drivers that use
ata_sg_is_last() with chained sg tables. I'd much rather just get rid of
ata_sg_is_last(), it's only used to mark end-of-table entries for
hardware. That logic can be performed cheaper.

Torsten, your analysis does look correct. Does it work with this simple
patch?

diff --git a/include/linux/libata.h b/include/linux/libata.h
index 2784163..0152bf4 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -947,7 +947,7 @@ ata_sg_is_last(struct scatterlist *sg, struct 
ata_queued_cmd *qc)
return 1;
if (qc-pad_len)
return 0;
-   if (qc-n_iter == qc-n_elem)
+   if ((qc-n_iter + 1) == qc-n_elem)
return 1;
return 0;
 }

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

Jens Axboe wrote:
 This is the old ata_sg_is_last:
 
 static inline int
 ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
 {
 if (sg == qc-pad_sgent)
 return 1;
 if (qc-pad_len)
 return 0;
 if (((sg - qc-__sg) + 1) == qc-n_elem)
 return 1;
 return 0;
 }
 
 and the new one:
 
 static inline int
 ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
 {
 if (sg == qc-pad_sgent)
 return 1;
 if (qc-pad_len)
 return 0;
 if (qc-n_iter == qc-n_elem)
 return 1; 
 return 0;
 }
 
 -n_iter is how ata_qc_next_sg() walks over the sglist, I don't
 understand your reference to why depending on that during iteration
 would be bad?

Because that makes ata_sg iterator macros have hidden side effects
(nothing in the interface suggests it can't be nested and when somebody
actually nests it, finding what went wrong can be pretty difficult).  I
think it would be better to have explicit ata_sg_iter passed around if
sg itself isn't enough to walk the sg list.

 So we could add a test for sg_last() there, but that would turn sg table
 iteration into an O(N^2) operation for those drivers that use
 ata_sg_is_last() with chained sg tables. I'd much rather just get rid of
 ata_sg_is_last(), it's only used to mark end-of-table entries for
 hardware. That logic can be performed cheaper.

Yeap, it can be removed but having is this the last one? test is just
nicer to low level drivers.  With ata_sg_iter, I think we can do it in
O(N) by looking up and caching the next entry.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On Thu, Oct 11 2007, Tejun Heo wrote:
 Jens Axboe wrote:
  This is the old ata_sg_is_last:
  
  static inline int
  ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
  {
  if (sg == qc-pad_sgent)
  return 1;
  if (qc-pad_len)
  return 0;
  if (((sg - qc-__sg) + 1) == qc-n_elem)
  return 1;
  return 0;
  }
  
  and the new one:
  
  static inline int
  ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
  {
  if (sg == qc-pad_sgent)
  return 1;
  if (qc-pad_len)
  return 0;
  if (qc-n_iter == qc-n_elem)
  return 1; 
  return 0;
  }
  
  -n_iter is how ata_qc_next_sg() walks over the sglist, I don't
  understand your reference to why depending on that during iteration
  would be bad?
 
 Because that makes ata_sg iterator macros have hidden side effects
 (nothing in the interface suggests it can't be nested and when somebody
 actually nests it, finding what went wrong can be pretty difficult).  I
 think it would be better to have explicit ata_sg_iter passed around if
 sg itself isn't enough to walk the sg list.

I think it's implicit in the interface, since you don't pass an sgtable
in. But it's not a big deal to me, if you want it changed, go ahead :-)

  So we could add a test for sg_last() there, but that would turn sg table
  iteration into an O(N^2) operation for those drivers that use
  ata_sg_is_last() with chained sg tables. I'd much rather just get rid of
  ata_sg_is_last(), it's only used to mark end-of-table entries for
  hardware. That logic can be performed cheaper.
 
 Yeap, it can be removed but having is this the last one? test is just
 nicer to low level drivers.  With ata_sg_iter, I think we can do it in
 O(N) by looking up and caching the next entry.

Sure, of you could just lookup sg_last() in the beginning of the loop.
Still seems a bit silly to me just to keep the ata_sg_is_last()
interface, since it can be done for free basically by maintaining an
sglast element in the loop that you then use when the loop is done to
set your SG_END marker (or whatever the driver needs).

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

2007-10-11 Thread Torsten Kaiser

On 10/11/07, Tejun Heo [EMAIL PROTECTED] wrote:
 Torsten Kaiser wrote:
  That missing +1 would explain, why the SGE_TRM never gets set.
  Thanks a lot for tracking this down.  Does changing the above code fix
  your problem?
 
  I did not try it.
  I'm not an libata expert and while this change looks suspicios, I
  can't be 100% sure if that change was intended.
  And I did not want to experiment this deep in the code and risk
  corrupting the hole drive.

 I don't think you would risk too much by changing that bit of code.
 Please try it.

The debug output now shows that SGE_TRM is getting set.
I will try to reproduce the error tomorrow morning, and report if I'm
still seeing it.

Reporting that I'm no longer see it is sadly not a good indicator, as
it also seems to depend on other factors like memory layout and/or
code alignment.

Torsten
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

2007-10-10 Thread Torsten Kaiser

On 10/11/07, Tejun Heo <[EMAIL PROTECTED]> wrote:
> Torsten Kaiser wrote:
> > Looking closer at
> > http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git;a=commitdiff;h=ec6fdded4d76aa54aa57341e5dfdd61c507b1dcd
> > the change to libata.h seems bogus :
> >
> > in ata_qc_first_sg:
> > oldnew
> > return qc->__sgreturn qc->__sg
> > qc->__sg - qc->__sg == 0   qc->n_iter=0
> > -> sg - qc->__sg corresponds to qc->n_iter
> >
> > in ata_qc_next_sg:
> > sg++;  sg_next(sg); qc->n_iter++;
> > sg - qc->__sg < qc->n_elem qc->n_iter < qc->nelem
> > -> sg - qc->__sg corresponds to qc->n_iter
> >
> > but in ata_sg_is_last:
> > (sg - qc->__sg) +1 == qc->n_elem   qc->n_iter == qc->n_elem
> > if sg - qc->__sg corresponds to qc->n_iter then shoudn't it be
> > qc->n_iter+1 == qc->n_elem?
> >
> > That missing +1 would explain, why the SGE_TRM never gets set.
>
> Thanks a lot for tracking this down.  Does changing the above code fix
> your problem?

I did not try it.
I'm not an libata expert and while this change looks suspicios, I
can't be 100% sure if that change was intended.
And I did not want to experiment this deep in the code and risk
corrupting the hole drive.

> Jens, Torsten's analysis looks correct && depending on qc state (n_iter)
> during iteration doesn't look like a good idea.  Those iterators are not
> supposed to have side effects.  Would it be difficult to implement
> sg_last() test?
>
> > And it would fit the symptoms, that the boot would fail at random. If
> > the "correct" garbage was in place to where the sglist runs off it
> > hangs the drive.
> > And that would even fit the two different errors that I only got one time 
> > each:
> > * a completely illegal access (PCI master abort while fetching SGT)
> > * wrong alignment of the SGT (SGT no on qword boundary)
> > At that that times the garbage seemed to point invalid addresses.
> >
> > But I'm still not understanding, how the kernel could only fail
> > sometimes at bootup, but after that working without any visible
> > errors? Is the sil-chip rather intelligent about detecting corrupted
> > sglists and silently ignoring them?
>
> I have no idea why it fails only sometimes.

And that is, why I'm so unsure.
The error looks to serious to only cause random failures on one of two
drives on bootup.
I never had trouble with the remaining drive on the SiI-chip or both
drives if one got killed during booting.

I'm guessing that leaving the computer powered down long enough fills
the RAM with a special pattern that really hangs the drive, while
normaly it would just reject the invalid data. (I have ECC-RAM, does
this matter?)

Another guess might be that most of the time the Sil-chip correctly
terminates after the transfer-length is reached, even if SGE_TRM is
missing...

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

2007-10-10 Thread Tejun Heo

Torsten Kaiser wrote:
> Looking closer at
> http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git;a=commitdiff;h=ec6fdded4d76aa54aa57341e5dfdd61c507b1dcd
> the change to libata.h seems bogus :
> 
> in ata_qc_first_sg:
> oldnew
> return qc->__sgreturn qc->__sg
> qc->__sg - qc->__sg == 0   qc->n_iter=0
> -> sg - qc->__sg corresponds to qc->n_iter
> 
> in ata_qc_next_sg:
> sg++;  sg_next(sg); qc->n_iter++;
> sg - qc->__sg < qc->n_elem qc->n_iter < qc->nelem
> -> sg - qc->__sg corresponds to qc->n_iter
> 
> but in ata_sg_is_last:
> (sg - qc->__sg) +1 == qc->n_elem   qc->n_iter == qc->n_elem
> if sg - qc->__sg corresponds to qc->n_iter then shoudn't it be
> qc->n_iter+1 == qc->n_elem?
> 
> That missing +1 would explain, why the SGE_TRM never gets set.

Thanks a lot for tracking this down.  Does changing the above code fix
your problem?

Jens, Torsten's analysis looks correct && depending on qc state (n_iter)
during iteration doesn't look like a good idea.  Those iterators are not
supposed to have side effects.  Would it be difficult to implement
sg_last() test?

> And it would fit the symptoms, that the boot would fail at random. If
> the "correct" garbage was in place to where the sglist runs off it
> hangs the drive.
> And that would even fit the two different errors that I only got one time 
> each:
> * a completely illegal access (PCI master abort while fetching SGT)
> * wrong alignment of the SGT (SGT no on qword boundary)
> At that that times the garbage seemed to point invalid addresses.
> 
> But I'm still not understanding, how the kernel could only fail
> sometimes at bootup, but after that working without any visible
> errors? Is the sil-chip rather intelligent about detecting corrupted
> sglists and silently ignoring them?

I have no idea why it fails only sometimes.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

2007-10-10 Thread Tejun Heo

Torsten Kaiser wrote:
 Looking closer at
 http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git;a=commitdiff;h=ec6fdded4d76aa54aa57341e5dfdd61c507b1dcd
 the change to libata.h seems bogus :
 
 in ata_qc_first_sg:
 oldnew
 return qc-__sgreturn qc-__sg
 qc-__sg - qc-__sg == 0   qc-n_iter=0
 - sg - qc-__sg corresponds to qc-n_iter
 
 in ata_qc_next_sg:
 sg++;  sg_next(sg); qc-n_iter++;
 sg - qc-__sg  qc-n_elem qc-n_iter  qc-nelem
 - sg - qc-__sg corresponds to qc-n_iter
 
 but in ata_sg_is_last:
 (sg - qc-__sg) +1 == qc-n_elem   qc-n_iter == qc-n_elem
 if sg - qc-__sg corresponds to qc-n_iter then shoudn't it be
 qc-n_iter+1 == qc-n_elem?
 
 That missing +1 would explain, why the SGE_TRM never gets set.

Thanks a lot for tracking this down.  Does changing the above code fix
your problem?

Jens, Torsten's analysis looks correct  depending on qc state (n_iter)
during iteration doesn't look like a good idea.  Those iterators are not
supposed to have side effects.  Would it be difficult to implement
sg_last() test?

 And it would fit the symptoms, that the boot would fail at random. If
 the correct garbage was in place to where the sglist runs off it
 hangs the drive.
 And that would even fit the two different errors that I only got one time 
 each:
 * a completely illegal access (PCI master abort while fetching SGT)
 * wrong alignment of the SGT (SGT no on qword boundary)
 At that that times the garbage seemed to point invalid addresses.
 
 But I'm still not understanding, how the kernel could only fail
 sometimes at bootup, but after that working without any visible
 errors? Is the sil-chip rather intelligent about detecting corrupted
 sglists and silently ignoring them?

I have no idea why it fails only sometimes.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

2007-10-10 Thread Torsten Kaiser

On 10/11/07, Tejun Heo [EMAIL PROTECTED] wrote:
Torsten Kaiser wrote:
Looking closer at
http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git;a=commitdiff;h=ec6fdded4d76aa54aa57341e5dfdd61c507b1dcd
the change to libata.h seems bogus :

in ata_qc_first_sg:
oldnew
return qc-__sgreturn qc-__sg
qc-__sg - qc-__sg == 0 qc-n_iter=0
- sg - qc-__sg corresponds to qc-n_iter

in ata_qc_next_sg:
sg++; sg_next(sg); qc-n_iter++;
sg - qc-__sg qc-n_elem qc-n_iter qc-nelem
- sg - qc-__sg corresponds to qc-n_iter

but in ata_sg_is_last:
(sg - qc-__sg) +1 == qc-n_elem qc-n_iter == qc-n_elem
if sg - qc-__sg corresponds to qc-n_iter then shoudn't it be
qc-n_iter+1 == qc-n_elem?

That missing +1 would explain, why the SGE_TRM never gets set.

Thanks a lot for tracking this down. Does changing the above code fix
your problem?

I did not try it.
I'm not an libata expert and while this change looks suspicios, I
can't be 100% sure if that change was intended.
And I did not want to experiment this deep in the code and risk
corrupting the hole drive.

Jens, Torsten's analysis looks correct depending on qc state (n_iter)
during iteration doesn't look like a good idea. Those iterators are not
supposed to have side effects. Would it be difficult to implement
sg_last() test?

And it would fit the symptoms, that the boot would fail at random. If
the correct garbage was in place to where the sglist runs off it
hangs the drive.
And that would even fit the two different errors that I only got one time
each:
* a completely illegal access (PCI master abort while fetching SGT)
* wrong alignment of the SGT (SGT no on qword boundary)
At that that times the garbage seemed to point invalid addresses.

But I'm still not understanding, how the kernel could only fail
sometimes at bootup, but after that working without any visible
errors? Is the sil-chip rather intelligent about detecting corrupted
sglists and silently ignoring them?

I have no idea why it fails only sometimes.

And that is, why I'm so unsure.
The error looks to serious to only cause random failures on one of two
drives on bootup.
I never had trouble with the remaining drive on the SiI-chip or both
drives if one got killed during booting.

I'm guessing that leaving the computer powered down long enough fills
the RAM with a special pattern that really hangs the drive, while
normaly it would just reject the invalid data. (I have ECC-RAM, does
this matter?)

Another guess might be that most of the time the Sil-chip correctly
terminates after the transfer-length is reached, even if SGE_TRM is
missing...

Torsten
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

[Adding Jens Axboe, the author of what looks like the probable cause]
On 10/7/07, Torsten Kaiser <[EMAIL PROTECTED]> wrote:
> My sil24_fill_sg now looks like this:
> static inline void sil24_fill_sg(struct ata_queued_cmd *qc,
>  struct sil24_sge *sge)
> {
> struct scatterlist *sg;
>
> ata_for_each_sg(sg, qc) {
> sge->addr = cpu_to_le64(sg_dma_address(sg));
> sge->cnt = cpu_to_le32(sg_dma_len(sg));
> if (ata_sg_is_last(sg, qc))
> sge->flags = cpu_to_le32(SGE_TRM);
> else
> sge->flags = 0;
> DPRINTK("flags,addr,cnt = 0x%x, 0x%X, 0x%X\n", sge->flags,
> sge->addr, sge->cnt);
> sge++;
> }
> }
>
> Suspicious is, that *all* output from this DPRINTK shows flags as 0x0,
> so the last sg is never terminated (SGE_TRM is 1<<31)?
> But if that is the cause, how is this working at all? Or am I doing
> something stupid?

Looking closer at
http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git;a=commitdiff;h=ec6fdded4d76aa54aa57341e5dfdd61c507b1dcd
the change to libata.h seems bogus :

in ata_qc_first_sg:
oldnew
return qc->__sgreturn qc->__sg
qc->__sg - qc->__sg == 0   qc->n_iter=0
-> sg - qc->__sg corresponds to qc->n_iter

in ata_qc_next_sg:
sg++;  sg_next(sg); qc->n_iter++;
sg - qc->__sg < qc->n_elem qc->n_iter < qc->nelem
-> sg - qc->__sg corresponds to qc->n_iter

but in ata_sg_is_last:
(sg - qc->__sg) +1 == qc->n_elem   qc->n_iter == qc->n_elem
if sg - qc->__sg corresponds to qc->n_iter then shoudn't it be
qc->n_iter+1 == qc->n_elem?

That missing +1 would explain, why the SGE_TRM never gets set.

And it would fit the symptoms, that the boot would fail at random. If
the "correct" garbage was in place to where the sglist runs off it
hangs the drive.
And that would even fit the two different errors that I only got one time each:
* a completely illegal access (PCI master abort while fetching SGT)
* wrong alignment of the SGT (SGT no on qword boundary)
At that that times the garbage seemed to point invalid addresses.

But I'm still not understanding, how the kernel could only fail
sometimes at bootup, but after that working without any visible
errors? Is the sil-chip rather intelligent about detecting corrupted
sglists and silently ignoring them?

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 10/5/07, Torsten Kaiser <[EMAIL PROTECTED]> wrote:
> So I will use the weekend to see if I can find out who issues this
> command and add more debug to that place...

I added some DPRINTK to sil24_qc_issue and sil24_fill_sg, but I only
found one suspicious thing.

My sil24_fill_sg now looks like this:
static inline void sil24_fill_sg(struct ata_queued_cmd *qc,
 struct sil24_sge *sge)
{
struct scatterlist *sg;

ata_for_each_sg(sg, qc) {
sge->addr = cpu_to_le64(sg_dma_address(sg));
sge->cnt = cpu_to_le32(sg_dma_len(sg));
if (ata_sg_is_last(sg, qc))
sge->flags = cpu_to_le32(SGE_TRM);
else
sge->flags = 0;
DPRINTK("flags,addr,cnt = 0x%x, 0x%X, 0x%X\n", sge->flags,
sge->addr, sge->cnt);
sge++;
}
}

Suspicious is, that *all* output from this DPRINTK shows flags as 0x0,
so the last sg is never terminated (SGE_TRM is 1<<31)?
But if that is the cause, how is this working at all? Or am I doing
something stupid?

Timing and outputs from five boots:
good:bad:
  more morebootmore
3->35 3->353->35 3->35 3->35
3->2a 2->352->35 3->2a 3->2a
3->setup  2->2a2->2a 3->setup  3->setup
2->35 2->352->35 2->35 2->35
1->35 3->2a3->2a 1->35 1->35
2->2a 3->setup 3->setup  2->2a 2->2a
1->2a 1->351->35 1->2a 1->2a
2->35 1->2a1->2a 2->35
1->35 1->351->35   1->35
3->int3->int   3->int3->int3->int
3->35 3->353->35 3->35 3->35
  1->5DF/1439C 1->5DC/1439C1->5DE/1439C
  2->5E0/143BC 2->5DE/143BC2->5DF/143BC
  sg:170E  sg:1AAB sg:1A60
XXX:
5DD   5DF  5DC   5DF   5DE
5E0   5E0  5DE   5E0   5DF

The first three columns where working tries, the last two failed one drive.
column 1: ATA_DEBUG added, reboot
column 2: +my additions, reboot
column 3: +my additions, cold boot, wanted to make it fail, but worked
column 4: ATA_DEBUG added, cold boot
column 5: +my additions, cold boot
[x]->[y]: x is the ata-port, 1+2 on the sata_sil24, 3 on sata_nv with swncq
y:35 -> SYNCHRONIZE_CACHE commands that where send to the drive
y:2a -> WRITE_10 commands that where send to the drive
y:setup -> Debug from swncq: nv_swncq_dmafis: dma setup tag 0x0
y:int -> Debug from swncq: nv_swncq_host_interrupt: id 0x3 SWNCQ:
qc_active 0x1 ...

The lines before the XXX:
x->a/b: x is the ata-port, a the paddr from sil24_qc_issue, b the
activate from sil24_qc_issue
All outputs from sil24_qc_issue where identical in each boot sequence,
only differed from run to run.
sg:a: a is the sge->addr from sil24_fill_sg

The lines after the XXX:
This are the addresses that the XXX-printk from sil24_port_start prints.

I hope I explained enough what above table should mean.
This hole sequence (two syncs and one write to each drive) happens
between the output:
[   40.30] md1: bitmap initialized from disk: read 10/10 pages, set 87 bits
[   40.32] created bitmap (145 pages) for device md1
and the error on a bad boot:
[   70.68] ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
[   70.70] ata2.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
or if on a good boot:
[   40.91] md: considering sdb1 ...
(sdb1 is part of another raid)

(If someone whats to complete bootlogs, just ask)

So now I have two questions:
1) What happens in sil24_fill_sg with SGE_TRM?
2) If that is ok, should I try to add debug to sil24_error_intr and/or
sil24_host_intr?

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 10/5/07, Torsten Kaiser [EMAIL PROTECTED] wrote:
 So I will use the weekend to see if I can find out who issues this
 command and add more debug to that place...

I added some DPRINTK to sil24_qc_issue and sil24_fill_sg, but I only
found one suspicious thing.

My sil24_fill_sg now looks like this:
static inline void sil24_fill_sg(struct ata_queued_cmd *qc,
 struct sil24_sge *sge)
{
struct scatterlist *sg;

ata_for_each_sg(sg, qc) {
sge-addr = cpu_to_le64(sg_dma_address(sg));
sge-cnt = cpu_to_le32(sg_dma_len(sg));
if (ata_sg_is_last(sg, qc))
sge-flags = cpu_to_le32(SGE_TRM);
else
sge-flags = 0;
DPRINTK(flags,addr,cnt = 0x%x, 0x%X, 0x%X\n, sge-flags,
sge-addr, sge-cnt);
sge++;
}
}

Suspicious is, that *all* output from this DPRINTK shows flags as 0x0,
so the last sg is never terminated (SGE_TRM is 131)?
But if that is the cause, how is this working at all? Or am I doing
something stupid?

Timing and outputs from five boots:
good:bad:
  more morebootmore
3-35 3-353-35 3-35 3-35
3-2a 2-352-35 3-2a 3-2a
3-setup  2-2a2-2a 3-setup  3-setup
2-35 2-352-35 2-35 2-35
1-35 3-2a3-2a 1-35 1-35
2-2a 3-setup 3-setup  2-2a 2-2a
1-2a 1-351-35 1-2a 1-2a
2-35 1-2a1-2a 2-35
1-35 1-351-35   1-35
3-int3-int   3-int3-int3-int
3-35 3-353-35 3-35 3-35
  1-5DF/1439C 1-5DC/1439C1-5DE/1439C
  2-5E0/143BC 2-5DE/143BC2-5DF/143BC
  sg:170E  sg:1AAB sg:1A60
XXX:
5DD   5DF  5DC   5DF   5DE
5E0   5E0  5DE   5E0   5DF

The first three columns where working tries, the last two failed one drive.
column 1: ATA_DEBUG added, reboot
column 2: +my additions, reboot
column 3: +my additions, cold boot, wanted to make it fail, but worked
column 4: ATA_DEBUG added, cold boot
column 5: +my additions, cold boot
[x]-[y]: x is the ata-port, 1+2 on the sata_sil24, 3 on sata_nv with swncq
y:35 - SYNCHRONIZE_CACHE commands that where send to the drive
y:2a - WRITE_10 commands that where send to the drive
y:setup - Debug from swncq: nv_swncq_dmafis: dma setup tag 0x0
y:int - Debug from swncq: nv_swncq_host_interrupt: id 0x3 SWNCQ:
qc_active 0x1 ...

The lines before the XXX:
x-a/b: x is the ata-port, a the paddr from sil24_qc_issue, b the
activate from sil24_qc_issue
All outputs from sil24_qc_issue where identical in each boot sequence,
only differed from run to run.
sg:a: a is the sge-addr from sil24_fill_sg

The lines after the XXX:
This are the addresses that the XXX-printk from sil24_port_start prints.

I hope I explained enough what above table should mean.
This hole sequence (two syncs and one write to each drive) happens
between the output:
[   40.30] md1: bitmap initialized from disk: read 10/10 pages, set 87 bits
[   40.32] created bitmap (145 pages) for device md1
and the error on a bad boot:
[   70.68] ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
[   70.70] ata2.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
or if on a good boot:
[   40.91] md: considering sdb1 ...
(sdb1 is part of another raid)

(If someone whats to complete bootlogs, just ask)

So now I have two questions:
1) What happens in sil24_fill_sg with SGE_TRM?
2) If that is ok, should I try to add debug to sil24_error_intr and/or
sil24_host_intr?

Torsten
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

[Adding Jens Axboe, the author of what looks like the probable cause]
On 10/7/07, Torsten Kaiser [EMAIL PROTECTED] wrote:
 My sil24_fill_sg now looks like this:
 static inline void sil24_fill_sg(struct ata_queued_cmd *qc,
  struct sil24_sge *sge)
 {
 struct scatterlist *sg;

 ata_for_each_sg(sg, qc) {
 sge-addr = cpu_to_le64(sg_dma_address(sg));
 sge-cnt = cpu_to_le32(sg_dma_len(sg));
 if (ata_sg_is_last(sg, qc))
 sge-flags = cpu_to_le32(SGE_TRM);
 else
 sge-flags = 0;
 DPRINTK(flags,addr,cnt = 0x%x, 0x%X, 0x%X\n, sge-flags,
 sge-addr, sge-cnt);
 sge++;
 }
 }

 Suspicious is, that *all* output from this DPRINTK shows flags as 0x0,
 so the last sg is never terminated (SGE_TRM is 131)?
 But if that is the cause, how is this working at all? Or am I doing
 something stupid?

Looking closer at
http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git;a=commitdiff;h=ec6fdded4d76aa54aa57341e5dfdd61c507b1dcd
the change to libata.h seems bogus :

in ata_qc_first_sg:
oldnew
return qc-__sgreturn qc-__sg
qc-__sg - qc-__sg == 0   qc-n_iter=0
- sg - qc-__sg corresponds to qc-n_iter

in ata_qc_next_sg:
sg++;  sg_next(sg); qc-n_iter++;
sg - qc-__sg  qc-n_elem qc-n_iter  qc-nelem
- sg - qc-__sg corresponds to qc-n_iter

but in ata_sg_is_last:
(sg - qc-__sg) +1 == qc-n_elem   qc-n_iter == qc-n_elem
if sg - qc-__sg corresponds to qc-n_iter then shoudn't it be
qc-n_iter+1 == qc-n_elem?

That missing +1 would explain, why the SGE_TRM never gets set.

And it would fit the symptoms, that the boot would fail at random. If
the correct garbage was in place to where the sglist runs off it
hangs the drive.
And that would even fit the two different errors that I only got one time each:
* a completely illegal access (PCI master abort while fetching SGT)
* wrong alignment of the SGT (SGT no on qword boundary)
At that that times the garbage seemed to point invalid addresses.

But I'm still not understanding, how the kernel could only fail
sometimes at bootup, but after that working without any visible
errors? Is the sil-chip rather intelligent about detecting corrupted
sglists and silently ignoring them?

Torsten
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

2007-10-05 Thread Torsten Kaiser

On 10/4/07, Matt Mackall <[EMAIL PROTECTED]> wrote:
> On Thu, Oct 04, 2007 at 07:32:52AM +0200, Torsten Kaiser wrote:
> > So now I'm rather out of ideas what to test... :(
>
> I'd give your previous bisect step another try.

Yes, I thought about that too. But I never seemed to need more than
two tries to make it fail.
So I would only suspect the last good step as wrong positive.
That would then point to the first of your maps2-patches, the moving
of the pagewalker code.
Would you thing that this is a plausible cause?

> Looking back at the thread a bit, anything that requires the machine
> to be off for more than a couple seconds to manifest stops looking
> like software and firmware and starts looking like a heat-related
> electrical or mechanical issue. Make sure your backups are current.

What backups? :-)

Yes, I also thought about hardware trouble, but the bisect result
seemed to consistent.
Also that its not always the same drive that fails, only every time
one of the sil-drives.

I now have activated ATA_DEBUG to see if the good and the bad boots differ.
It looks the same until the RAID5 starts.

Good boot:
[   40.16] ata_scsi_dump_cdb: CDB (2:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.16] ata_scsi_dump_cdb: CDB (1:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.16] ata_scsi_dump_cdb: CDB (2:0,0,0) 2a 00 25 42 d6 09 00 00 08
[   40.16] ata_sg_setup: 1 sg elements mapped
[   40.16] ata_scsi_dump_cdb: CDB (1:0,0,0) 2a 00 25 42 d6 09 00 00 08
[   40.16] ata_sg_setup: 1 sg elements mapped
[   40.16] ata_scsi_dump_cdb: CDB (2:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.16] ata_scsi_dump_cdb: CDB (1:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.32] nv_swncq_host_interrupt: id 0x3 SWNCQ: qc_active 0x1
dhfis 0x1 dmafis 0x1 sactive 0x0
[   40.32] nv_swncq_sdbfis: over
[   40.32] ata_scsi_dump_cdb: CDB (3:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.32] ata_exec_command: ata3: cmd 0xEA
[   40.39] ata_hsm_move: ata3: protocol 1 task_state 3 (dev_stat 0x40)
[   40.39] ata_hsm_move: ata3: dev 0 command complete, drv_stat 0x40
[   40.42] md: considering sdb1 ...
[   40.44] md:  adding sdb1 ...
[   40.44] md:  adding sda1 ...
[   40.45] md: created md0
[   40.46] md: bind
[   40.47] md: bind
[   40.48] md: running: 
[   40.50] raid1: raid set md0 active with 2 out of 2 mirrors

Bad boot:
[   40.06] ata_scsi_dump_cdb: CDB (2:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.06] ata_scsi_dump_cdb: CDB (1:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.06] ata_scsi_dump_cdb: CDB (2:0,0,0) 2a 00 25 42 d6 09 00 00 08
[   40.06] ata_sg_setup: 1 sg elements mapped
[   40.06] ata_scsi_dump_cdb: CDB (1:0,0,0) 2a 00 25 42 d6 09 00 00 08
[   40.06] ata_sg_setup: 1 sg elements mapped
[   40.06] ata_scsi_dump_cdb: CDB (2:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.20] nv_swncq_host_interrupt: id 0x3 SWNCQ: qc_active 0x1
dhfis 0x1 dmafis 0x1 sactive 0x0
[   40.20] nv_swncq_sdbfis: over
[   40.20] ata_scsi_dump_cdb: CDB (3:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.20] ata_exec_command: ata3: cmd 0xEA
[   40.27] ata_hsm_move: ata3: protocol 1 task_state 3 (dev_stat 0x40)
[   40.27] ata_hsm_move: ata3: dev 0 command complete, drv_stat 0x40
[   70.06] ata_scsi_timed_out: ENTER
[   70.06] ata_scsi_timed_out: EXIT, ret=0
[   70.08] ata_scsi_error: ENTER
[   70.08] ata_port_flush_task: ENTER
[   70.10] ata1: ata_port_flush_task: EXIT
[   70.11] __ata_port_freeze: ata1 port frozen
[   70.22] __ata_port_freeze: ata1 port frozen
[   70.23] ata_eh_link_autopsy: ENTER
[   70.24] ata_eh_link_autopsy: EXIT
[   70.25] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
[   70.27] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
[   70.27]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask
0x4 (timeout)

After [   40.06] ata_scsi_dump_cdb: CDB (1:0,0,0) 2a 00 25 42 d6 09 00 00 08
the drive sda falls of the earth and can't be recovered through soft-
or hard-resetting the port by the error handler.

So I will use the weekend to see if I can find out who issues this
command and add more debug to that place...

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

2007-10-05 Thread Torsten Kaiser

On 10/4/07, Matt Mackall [EMAIL PROTECTED] wrote:
 On Thu, Oct 04, 2007 at 07:32:52AM +0200, Torsten Kaiser wrote:
  So now I'm rather out of ideas what to test... :(

 I'd give your previous bisect step another try.

Yes, I thought about that too. But I never seemed to need more than
two tries to make it fail.
So I would only suspect the last good step as wrong positive.
That would then point to the first of your maps2-patches, the moving
of the pagewalker code.
Would you thing that this is a plausible cause?

 Looking back at the thread a bit, anything that requires the machine
 to be off for more than a couple seconds to manifest stops looking
 like software and firmware and starts looking like a heat-related
 electrical or mechanical issue. Make sure your backups are current.

What backups? :-)

Yes, I also thought about hardware trouble, but the bisect result
seemed to consistent.
Also that its not always the same drive that fails, only every time
one of the sil-drives.

I now have activated ATA_DEBUG to see if the good and the bad boots differ.
It looks the same until the RAID5 starts.

Good boot:
[   40.16] ata_scsi_dump_cdb: CDB (2:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.16] ata_scsi_dump_cdb: CDB (1:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.16] ata_scsi_dump_cdb: CDB (2:0,0,0) 2a 00 25 42 d6 09 00 00 08
[   40.16] ata_sg_setup: 1 sg elements mapped
[   40.16] ata_scsi_dump_cdb: CDB (1:0,0,0) 2a 00 25 42 d6 09 00 00 08
[   40.16] ata_sg_setup: 1 sg elements mapped
[   40.16] ata_scsi_dump_cdb: CDB (2:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.16] ata_scsi_dump_cdb: CDB (1:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.32] nv_swncq_host_interrupt: id 0x3 SWNCQ: qc_active 0x1
dhfis 0x1 dmafis 0x1 sactive 0x0
[   40.32] nv_swncq_sdbfis: over
[   40.32] ata_scsi_dump_cdb: CDB (3:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.32] ata_exec_command: ata3: cmd 0xEA
[   40.39] ata_hsm_move: ata3: protocol 1 task_state 3 (dev_stat 0x40)
[   40.39] ata_hsm_move: ata3: dev 0 command complete, drv_stat 0x40
[   40.42] md: considering sdb1 ...
[   40.44] md:  adding sdb1 ...
[   40.44] md:  adding sda1 ...
[   40.45] md: created md0
[   40.46] md: bindsda1
[   40.47] md: bindsdb1
[   40.48] md: running: sdb1sda1
[   40.50] raid1: raid set md0 active with 2 out of 2 mirrors

Bad boot:
[   40.06] ata_scsi_dump_cdb: CDB (2:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.06] ata_scsi_dump_cdb: CDB (1:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.06] ata_scsi_dump_cdb: CDB (2:0,0,0) 2a 00 25 42 d6 09 00 00 08
[   40.06] ata_sg_setup: 1 sg elements mapped
[   40.06] ata_scsi_dump_cdb: CDB (1:0,0,0) 2a 00 25 42 d6 09 00 00 08
[   40.06] ata_sg_setup: 1 sg elements mapped
[   40.06] ata_scsi_dump_cdb: CDB (2:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.20] nv_swncq_host_interrupt: id 0x3 SWNCQ: qc_active 0x1
dhfis 0x1 dmafis 0x1 sactive 0x0
[   40.20] nv_swncq_sdbfis: over
[   40.20] ata_scsi_dump_cdb: CDB (3:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.20] ata_exec_command: ata3: cmd 0xEA
[   40.27] ata_hsm_move: ata3: protocol 1 task_state 3 (dev_stat 0x40)
[   40.27] ata_hsm_move: ata3: dev 0 command complete, drv_stat 0x40
[   70.06] ata_scsi_timed_out: ENTER
[   70.06] ata_scsi_timed_out: EXIT, ret=0
[   70.08] ata_scsi_error: ENTER
[   70.08] ata_port_flush_task: ENTER
[   70.10] ata1: ata_port_flush_task: EXIT
[   70.11] __ata_port_freeze: ata1 port frozen
[   70.22] __ata_port_freeze: ata1 port frozen
[   70.23] ata_eh_link_autopsy: ENTER
[   70.24] ata_eh_link_autopsy: EXIT
[   70.25] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
[   70.27] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
[   70.27]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask
0x4 (timeout)

After [   40.06] ata_scsi_dump_cdb: CDB (1:0,0,0) 2a 00 25 42 d6 09 00 00 08
the drive sda falls of the earth and can't be recovered through soft-
or hard-resetting the port by the error handler.

So I will use the weekend to see if I can find out who issues this
command and add more debug to that place...

Torsten
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

2007-10-04 Thread Matt Mackall

On Thu, Oct 04, 2007 at 07:32:52AM +0200, Torsten Kaiser wrote:
> On 10/3/07, Matt Mackall <[EMAIL PROTECTED]> wrote:
> > Well I can see no reason why the vma we just got to by the mm->mmap
> > would have a vm_mm != mm, but I've certainly been wrong before.
> >
> > Try changing it to:
> >
> > for (vma = mm->mmap; vma; vma = vma->vm_next)
> > if (!is_vm_hugetlb_page(vma)) {
> > if (vma->vm_mm != mm)
> > printk("WTF: vma->vm_mm %p mm %p\n",
> > vma->vm_mm, mm);
> > walk_page_range(vma->vm_mm, vma->vm_start, 
> > vma->vm_end,
> > _refs_walk, vma);
> > }
> 
> You were right.
> I was able to trigger the error with above printk added, but nothing
> was written to the syslog.
> 
> So now I'm rather out of ideas what to test... :(

I'd give your previous bisect step another try.

Looking back at the thread a bit, anything that requires the machine
to be off for more than a couple seconds to manifest stops looking
like software and firmware and starts looking like a heat-related
electrical or mechanical issue. Make sure your backups are current.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

2007-10-04 Thread Matt Mackall

On Thu, Oct 04, 2007 at 07:32:52AM +0200, Torsten Kaiser wrote:
 On 10/3/07, Matt Mackall [EMAIL PROTECTED] wrote:
  Well I can see no reason why the vma we just got to by the mm-mmap
  would have a vm_mm != mm, but I've certainly been wrong before.
 
  Try changing it to:
 
  for (vma = mm-mmap; vma; vma = vma-vm_next)
  if (!is_vm_hugetlb_page(vma)) {
  if (vma-vm_mm != mm)
  printk(WTF: vma-vm_mm %p mm %p\n,
  vma-vm_mm, mm);
  walk_page_range(vma-vm_mm, vma-vm_start, 
  vma-vm_end,
  clear_refs_walk, vma);
  }
 
 You were right.
 I was able to trigger the error with above printk added, but nothing
 was written to the syslog.
 
 So now I'm rather out of ideas what to test... :(

I'd give your previous bisect step another try.

Looking back at the thread a bit, anything that requires the machine
to be off for more than a couple seconds to manifest stops looking
like software and firmware and starts looking like a heat-related
electrical or mechanical issue. Make sure your backups are current.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 10/3/07, Matt Mackall <[EMAIL PROTECTED]> wrote:
> Well I can see no reason why the vma we just got to by the mm->mmap
> would have a vm_mm != mm, but I've certainly been wrong before.
>
> Try changing it to:
>
> for (vma = mm->mmap; vma; vma = vma->vm_next)
> if (!is_vm_hugetlb_page(vma)) {
> if (vma->vm_mm != mm)
> printk("WTF: vma->vm_mm %p mm %p\n",
> vma->vm_mm, mm);
> walk_page_range(vma->vm_mm, vma->vm_start, 
> vma->vm_end,
> _refs_walk, vma);
> }

You were right.
I was able to trigger the error with above printk added, but nothing
was written to the syslog.

So now I'm rather out of ideas what to test... :(

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 10/3/07, Matt Mackall <[EMAIL PROTECTED]> wrote:
> On Wed, Oct 03, 2007 at 07:36:55PM +0200, Torsten Kaiser wrote:
> > Of note might be, that at the time of this error init has not been
> > started. I'm using a program from initramfs to start the RAID.
> > The initramfs was primarily build using the genkernel package from
> > gentoo, but I have removed some parts of it (kernel modules and some
> > other part I no longer remember)
>
> Then I really don't see how these patches can have anything to do with
> your problem. They only affect some currently obscure files in /proc
> and it's unlikely to have any impact at init time. It's more likely
> that there's some code alignment Heisenbug or you got a false negative
> somewhere in your bisection.

Heisenbug via code alignment might be, but the bisection result seemed
rather reliable.
I have 4 good boots with all mm-patches before the maps2-* ones and 2
out of 3 boots failed with part or all maps2-* patches applied. (Any
many more failed boots with more mm-patches applied.)

But I do think it might be very possible that some (maybe even
correct) change in the behavior of this code triggers the error. So
I'm very interested in what you additional printk outputs to track
down what changed and why this breaks.

I looked at the scripts from the initramfs, but sadly I don't see
anything that crawls throug /proc and might hit your files. Other than
that there is nearly only busybox in there. cryptsetup should run
later and modprobe does not have any modules it could try to load...

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On Wed, Oct 03, 2007 at 07:36:55PM +0200, Torsten Kaiser wrote:
> On 10/3/07, Matt Mackall <[EMAIL PROTECTED]> wrote:
> > On Wed, Oct 03, 2007 at 05:55:10PM +0200, Torsten Kaiser wrote:
> > > This patch removes clear_refs_smap() from fs/proc/task_mmu.c by moving
> > > its code to a new function. But during the move the main for-loop from
> > > clear_refs_smap was changed:
> > >
> > > old:
> > >   for (vma = mm->mmap; vma; vma = vma->vm_next)
> > >   if (vma->vm_mm && !is_vm_hugetlb_page(vma))
> > >   walk_page_range(vma->vm_mm, vma->vm_start, 
> > > vma->vm_end,
> > >   _refs_walk, vma);
> > >
> > > new:
> > >   for (vma = mm->mmap; vma; vma = vma->vm_next)
> > >   if (!is_vm_hugetlb_page(vma))
> > >   walk_page_range(mm, vma->vm_start, vma->vm_end,
> > >   _refs_walk, vma);
> > >
> > > The walk_page_range() is no longer called on vma->vm_mm, but on mm 
> > > directly.
> > > I don't know how this can kill the sata_sil24-driver, but at least it
> > > looks suspicious.
> >
> > That code should be fine. Further, it's pretty unlikely that this code
> > ever gets invoked. This whole interface was only recently added by
> > Google folks and its usage is pretty obscure.
> >
> > Oh wait - you're _at_ Google, aren't you? Perhaps you're actually
> > using clear_refs.
> 
> No. I'm just using Google Mail.

Ahh, I see. Wasn't aware of the trademark dispute..

> Of note might be, that at the time of this error init has not been
> started. I'm using a program from initramfs to start the RAID.
> The initramfs was primarily build using the genkernel package from
> gentoo, but I have removed some parts of it (kernel modules and some
> other part I no longer remember)

Then I really don't see how these patches can have anything to do with
your problem. They only affect some currently obscure files in /proc
and it's unlikely to have any impact at init time. It's more likely
that there's some code alignment Heisenbug or you got a false negative
somewhere in your bisection.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 10/3/07, Matt Mackall <[EMAIL PROTECTED]> wrote:
> On Wed, Oct 03, 2007 at 05:55:10PM +0200, Torsten Kaiser wrote:
> > This patch removes clear_refs_smap() from fs/proc/task_mmu.c by moving
> > its code to a new function. But during the move the main for-loop from
> > clear_refs_smap was changed:
> >
> > old:
> >   for (vma = mm->mmap; vma; vma = vma->vm_next)
> >   if (vma->vm_mm && !is_vm_hugetlb_page(vma))
> >   walk_page_range(vma->vm_mm, vma->vm_start, 
> > vma->vm_end,
> >   _refs_walk, vma);
> >
> > new:
> >   for (vma = mm->mmap; vma; vma = vma->vm_next)
> >   if (!is_vm_hugetlb_page(vma))
> >   walk_page_range(mm, vma->vm_start, vma->vm_end,
> >   _refs_walk, vma);
> >
> > The walk_page_range() is no longer called on vma->vm_mm, but on mm directly.
> > I don't know how this can kill the sata_sil24-driver, but at least it
> > looks suspicious.
>
> That code should be fine. Further, it's pretty unlikely that this code
> ever gets invoked. This whole interface was only recently added by
> Google folks and its usage is pretty obscure.
>
> Oh wait - you're _at_ Google, aren't you? Perhaps you're actually
> using clear_refs.

No. I'm just using Google Mail.

And I'm not using any additional patches apart from what is in the mm-patchset.

> Well I can see no reason why the vma we just got to by the mm->mmap
> would have a vm_mm != mm, but I've certainly been wrong before.
>
> Try changing it to:
>
> for (vma = mm->mmap; vma; vma = vma->vm_next)
> if (!is_vm_hugetlb_page(vma)) {
> if (vma->vm_mm != mm)
> printk("WTF: vma->vm_mm %p mm %p\n",
> vma->vm_mm, mm);
> walk_page_range(vma->vm_mm, vma->vm_start, 
> vma->vm_end,
> _refs_walk, vma);
> }

Will try this, but as the error only seems to trigger, if I leave the
pc powered off for more then a hour, you will need to wait until
tomorrow.

Of note might be, that at the time of this error init has not been
started. I'm using a program from initramfs to start the RAID.
The initramfs was primarily build using the genkernel package from
gentoo, but I have removed some parts of it (kernel modules and some
other part I no longer remember)

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On Wed, Oct 03, 2007 at 05:55:10PM +0200, Torsten Kaiser wrote:
> [CC added to author of the bad patch]
> 
> Short recap:
> Since 2.6.23-rc4-mm1 all mm-kernel randomly fail one of two drives on
> my Silicon Image 3132. This failure happens when my initramfs wants to
> start the RAID that is on these drives.
> 
> The first error libata throws is:
> Oct  3 16:56:46 treogen [   63.32] ata2.00: exception Emask 0x0
> SAct 0x1 SErr 0x0 action 0x6 frozen
> Oct  3 16:56:46 treogen [   63.32] ata2.00: cmd
> 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
> Oct  3 16:56:46 treogen [   63.32]  res
> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> Oct  3 16:56:46 treogen [   63.32] ata2.00: status: {DRDY }
> 
> Resetting the sata link fails, the drive is no longer reachable until a 
> reboot.
> 
> I then bisected the mm-patches from 2.6.23-rc4-mm1 with the following result:
> 
> On 10/3/07, Torsten Kaiser <[EMAIL PROTECTED]> wrote:
> > I'm now finished with bisecting, still 2 patches, but I don't want to
> > spend another two hours waiting...
> >
> > And the winners are: (from broken-out patchset of 2.6.23-rc4-mm1)
> > maps2-simplify-interdependence-of-proc-pid-maps-and-smaps.patch
> > maps2-move-clear_refs-code-to-task_mmuc.patch
> >
> > Before these patches I have never seen the bug, with these I only got
> > two good boots when trying to recreate the problem. But even the
> > kernels that did one good boot failed on the second try.
> 
> The simplify-patch just seems to move some code around, but I see a
> real change in the other one:
> This patch removes clear_refs_smap() from fs/proc/task_mmu.c by moving
> its code to a new function. But during the move the main for-loop from
> clear_refs_smap was changed:
> 
> old:
>   for (vma = mm->mmap; vma; vma = vma->vm_next)
>   if (vma->vm_mm && !is_vm_hugetlb_page(vma))
>   walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
>   _refs_walk, vma);
> 
> new:
>   for (vma = mm->mmap; vma; vma = vma->vm_next)
>   if (!is_vm_hugetlb_page(vma))
>   walk_page_range(mm, vma->vm_start, vma->vm_end,
>   _refs_walk, vma);
> 
> The walk_page_range() is no longer called on vma->vm_mm, but on mm directly.
> I don't know how this can kill the sata_sil24-driver, but at least it
> looks suspicious.

That code should be fine. Further, it's pretty unlikely that this code
ever gets invoked. This whole interface was only recently added by
Google folks and its usage is pretty obscure. 

Oh wait - you're _at_ Google, aren't you? Perhaps you're actually
using clear_refs.

Well I can see no reason why the vma we just got to by the mm->mmap
would have a vm_mm != mm, but I've certainly been wrong before.

Try changing it to:

for (vma = mm->mmap; vma; vma = vma->vm_next)
if (!is_vm_hugetlb_page(vma)) {
if (vma->vm_mm != mm)
printk("WTF: vma->vm_mm %p mm %p\n",
vma->vm_mm, mm);
walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
_refs_walk, vma);
}

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

[CC added to author of the bad patch]

Short recap:
Since 2.6.23-rc4-mm1 all mm-kernel randomly fail one of two drives on
my Silicon Image 3132. This failure happens when my initramfs wants to
start the RAID that is on these drives.

The first error libata throws is:
Oct  3 16:56:46 treogen [   63.32] ata2.00: exception Emask 0x0
SAct 0x1 SErr 0x0 action 0x6 frozen
Oct  3 16:56:46 treogen [   63.32] ata2.00: cmd
61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
Oct  3 16:56:46 treogen [   63.32]  res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct  3 16:56:46 treogen [   63.32] ata2.00: status: {DRDY }

Resetting the sata link fails, the drive is no longer reachable until a reboot.

I then bisected the mm-patches from 2.6.23-rc4-mm1 with the following result:

On 10/3/07, Torsten Kaiser <[EMAIL PROTECTED]> wrote:
> I'm now finished with bisecting, still 2 patches, but I don't want to
> spend another two hours waiting...
>
> And the winners are: (from broken-out patchset of 2.6.23-rc4-mm1)
> maps2-simplify-interdependence-of-proc-pid-maps-and-smaps.patch
> maps2-move-clear_refs-code-to-task_mmuc.patch
>
> Before these patches I have never seen the bug, with these I only got
> two good boots when trying to recreate the problem. But even the
> kernels that did one good boot failed on the second try.

The simplify-patch just seems to move some code around, but I see a
real change in the other one:
This patch removes clear_refs_smap() from fs/proc/task_mmu.c by moving
its code to a new function. But during the move the main for-loop from
clear_refs_smap was changed:

old:
for (vma = mm->mmap; vma; vma = vma->vm_next)
if (vma->vm_mm && !is_vm_hugetlb_page(vma))
walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
_refs_walk, vma);

new:
for (vma = mm->mmap; vma; vma = vma->vm_next)
if (!is_vm_hugetlb_page(vma))
walk_page_range(mm, vma->vm_start, vma->vm_end,
_refs_walk, vma);

The walk_page_range() is no longer called on vma->vm_mm, but on mm directly.
I don't know how this can kill the sata_sil24-driver, but at least it
looks suspicious.
As I'm not really a kernel hacker, I defer this question to the ones that are.

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 10/1/07, Torsten Kaiser <[EMAIL PROTECTED]> wrote:
> On 9/30/07, Torsten Kaiser <[EMAIL PROTECTED]> wrote:
> > On 9/30/07, Tejun Heo <[EMAIL PROTECTED]> wrote:
> > > Torsten Kaiser wrote:
> To make that comment "cmd part of the output was always the same" more
> clear: I did not only mean that the first (cmd-)byte was the same, but
> the whole line:
> cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
>
> I did find ATA_CMD_FPDMA_WRITE = 0x61in ata.h, but I don't find a good
> hint for the 08 that follows. The interpretation XFER_PIO_0 = 0x08
> seems wrong...

Even all of the last bisect steps still showed the same line...

> If I look at what patches remain, it seems that some other (earlier)
> patch that is new in 2.6.23-rc4-mm1 is the trigger, but it will only
> fail together with a second patch.

I'm now finished with bisecting, still 2 patches, but I don't want to
spend another two hours waiting...

And the winners are: (from broken-out patchset of 2.6.23-rc4-mm1)
maps2-simplify-interdependence-of-proc-pid-maps-and-smaps.patch
maps2-move-clear_refs-code-to-task_mmuc.patch

Before these patches I have never seen the bug, with these I only got
two good boots when trying to recreate the problem. But even the
kernels that did one good boot failed on the second try.

> > As for you printk:
>From the last testkernel (up to incl.
maps2-move-clear_refs-code-to-task_mmuc.patch)

Good:
Oct  3 15:50:39 treogen [3.78] XXX sil24 cb=810005dd
cb_dma=5dd
Oct  3 15:50:39 treogen [3.79] XXX sil24 cb=810005de
cb_dma=5de

Bad:
Oct  3 16:56:46 treogen [3.75] XXX sil24 cb=810005de
cb_dma=5de
Oct  3 16:56:46 treogen [3.76] XXX sil24 cb=810005df
cb_dma=5df

I have *NO* clue how these procfs changes can kill the sata_sil24
driver, but all bisect steps point to these two patches as cause or
trigger of the bug.

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 10/1/07, Torsten Kaiser [EMAIL PROTECTED] wrote:
 On 9/30/07, Torsten Kaiser [EMAIL PROTECTED] wrote:
  On 9/30/07, Tejun Heo [EMAIL PROTECTED] wrote:
   Torsten Kaiser wrote:
 To make that comment cmd part of the output was always the same more
 clear: I did not only mean that the first (cmd-)byte was the same, but
 the whole line:
 cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out

 I did find ATA_CMD_FPDMA_WRITE = 0x61in ata.h, but I don't find a good
 hint for the 08 that follows. The interpretation XFER_PIO_0 = 0x08
 seems wrong...

Even all of the last bisect steps still showed the same line...

 If I look at what patches remain, it seems that some other (earlier)
 patch that is new in 2.6.23-rc4-mm1 is the trigger, but it will only
 fail together with a second patch.

I'm now finished with bisecting, still 2 patches, but I don't want to
spend another two hours waiting...

And the winners are: (from broken-out patchset of 2.6.23-rc4-mm1)
maps2-simplify-interdependence-of-proc-pid-maps-and-smaps.patch
maps2-move-clear_refs-code-to-task_mmuc.patch

Before these patches I have never seen the bug, with these I only got
two good boots when trying to recreate the problem. But even the
kernels that did one good boot failed on the second try.

  As for you printk:
From the last testkernel (up to incl.
maps2-move-clear_refs-code-to-task_mmuc.patch)

Good:
Oct  3 15:50:39 treogen [3.78] XXX sil24 cb=810005dd
cb_dma=5dd
Oct  3 15:50:39 treogen [3.79] XXX sil24 cb=810005de
cb_dma=5de

Bad:
Oct  3 16:56:46 treogen [3.75] XXX sil24 cb=810005de
cb_dma=5de
Oct  3 16:56:46 treogen [3.76] XXX sil24 cb=810005df
cb_dma=5df

I have *NO* clue how these procfs changes can kill the sata_sil24
driver, but all bisect steps point to these two patches as cause or
trigger of the bug.

Torsten
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

[CC added to author of the bad patch]

Short recap:
Since 2.6.23-rc4-mm1 all mm-kernel randomly fail one of two drives on
my Silicon Image 3132. This failure happens when my initramfs wants to
start the RAID that is on these drives.

The first error libata throws is:
Oct  3 16:56:46 treogen [   63.32] ata2.00: exception Emask 0x0
SAct 0x1 SErr 0x0 action 0x6 frozen
Oct  3 16:56:46 treogen [   63.32] ata2.00: cmd
61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
Oct  3 16:56:46 treogen [   63.32]  res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct  3 16:56:46 treogen [   63.32] ata2.00: status: {DRDY }

Resetting the sata link fails, the drive is no longer reachable until a reboot.

I then bisected the mm-patches from 2.6.23-rc4-mm1 with the following result:

On 10/3/07, Torsten Kaiser [EMAIL PROTECTED] wrote:
 I'm now finished with bisecting, still 2 patches, but I don't want to
 spend another two hours waiting...

 And the winners are: (from broken-out patchset of 2.6.23-rc4-mm1)
 maps2-simplify-interdependence-of-proc-pid-maps-and-smaps.patch
 maps2-move-clear_refs-code-to-task_mmuc.patch

 Before these patches I have never seen the bug, with these I only got
 two good boots when trying to recreate the problem. But even the
 kernels that did one good boot failed on the second try.

The simplify-patch just seems to move some code around, but I see a
real change in the other one:
This patch removes clear_refs_smap() from fs/proc/task_mmu.c by moving
its code to a new function. But during the move the main for-loop from
clear_refs_smap was changed:

old:
for (vma = mm-mmap; vma; vma = vma-vm_next)
if (vma-vm_mm  !is_vm_hugetlb_page(vma))
walk_page_range(vma-vm_mm, vma-vm_start, vma-vm_end,
clear_refs_walk, vma);

new:
for (vma = mm-mmap; vma; vma = vma-vm_next)
if (!is_vm_hugetlb_page(vma))
walk_page_range(mm, vma-vm_start, vma-vm_end,
clear_refs_walk, vma);

The walk_page_range() is no longer called on vma-vm_mm, but on mm directly.
I don't know how this can kill the sata_sil24-driver, but at least it
looks suspicious.
As I'm not really a kernel hacker, I defer this question to the ones that are.

Torsten
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On Wed, Oct 03, 2007 at 05:55:10PM +0200, Torsten Kaiser wrote:
 [CC added to author of the bad patch]
 
 Short recap:
 Since 2.6.23-rc4-mm1 all mm-kernel randomly fail one of two drives on
 my Silicon Image 3132. This failure happens when my initramfs wants to
 start the RAID that is on these drives.
 
 The first error libata throws is:
 Oct  3 16:56:46 treogen [   63.32] ata2.00: exception Emask 0x0
 SAct 0x1 SErr 0x0 action 0x6 frozen
 Oct  3 16:56:46 treogen [   63.32] ata2.00: cmd
 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
 Oct  3 16:56:46 treogen [   63.32]  res
 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
 Oct  3 16:56:46 treogen [   63.32] ata2.00: status: {DRDY }
 
 Resetting the sata link fails, the drive is no longer reachable until a 
 reboot.
 
 I then bisected the mm-patches from 2.6.23-rc4-mm1 with the following result:
 
 On 10/3/07, Torsten Kaiser [EMAIL PROTECTED] wrote:
  I'm now finished with bisecting, still 2 patches, but I don't want to
  spend another two hours waiting...
 
  And the winners are: (from broken-out patchset of 2.6.23-rc4-mm1)
  maps2-simplify-interdependence-of-proc-pid-maps-and-smaps.patch
  maps2-move-clear_refs-code-to-task_mmuc.patch
 
  Before these patches I have never seen the bug, with these I only got
  two good boots when trying to recreate the problem. But even the
  kernels that did one good boot failed on the second try.
 
 The simplify-patch just seems to move some code around, but I see a
 real change in the other one:
 This patch removes clear_refs_smap() from fs/proc/task_mmu.c by moving
 its code to a new function. But during the move the main for-loop from
 clear_refs_smap was changed:
 
 old:
   for (vma = mm-mmap; vma; vma = vma-vm_next)
   if (vma-vm_mm  !is_vm_hugetlb_page(vma))
   walk_page_range(vma-vm_mm, vma-vm_start, vma-vm_end,
   clear_refs_walk, vma);
 
 new:
   for (vma = mm-mmap; vma; vma = vma-vm_next)
   if (!is_vm_hugetlb_page(vma))
   walk_page_range(mm, vma-vm_start, vma-vm_end,
   clear_refs_walk, vma);
 
 The walk_page_range() is no longer called on vma-vm_mm, but on mm directly.
 I don't know how this can kill the sata_sil24-driver, but at least it
 looks suspicious.

That code should be fine. Further, it's pretty unlikely that this code
ever gets invoked. This whole interface was only recently added by
Google folks and its usage is pretty obscure. 

Oh wait - you're _at_ Google, aren't you? Perhaps you're actually
using clear_refs.

Well I can see no reason why the vma we just got to by the mm-mmap
would have a vm_mm != mm, but I've certainly been wrong before.

Try changing it to:

for (vma = mm-mmap; vma; vma = vma-vm_next)
if (!is_vm_hugetlb_page(vma)) {
if (vma-vm_mm != mm)
printk(WTF: vma-vm_mm %p mm %p\n,
vma-vm_mm, mm);
walk_page_range(vma-vm_mm, vma-vm_start, vma-vm_end,
clear_refs_walk, vma);
}

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 10/3/07, Matt Mackall [EMAIL PROTECTED] wrote:
 On Wed, Oct 03, 2007 at 05:55:10PM +0200, Torsten Kaiser wrote:
  This patch removes clear_refs_smap() from fs/proc/task_mmu.c by moving
  its code to a new function. But during the move the main for-loop from
  clear_refs_smap was changed:
 
  old:
for (vma = mm-mmap; vma; vma = vma-vm_next)
if (vma-vm_mm  !is_vm_hugetlb_page(vma))
walk_page_range(vma-vm_mm, vma-vm_start, 
  vma-vm_end,
clear_refs_walk, vma);
 
  new:
for (vma = mm-mmap; vma; vma = vma-vm_next)
if (!is_vm_hugetlb_page(vma))
walk_page_range(mm, vma-vm_start, vma-vm_end,
clear_refs_walk, vma);
 
  The walk_page_range() is no longer called on vma-vm_mm, but on mm directly.
  I don't know how this can kill the sata_sil24-driver, but at least it
  looks suspicious.

 That code should be fine. Further, it's pretty unlikely that this code
 ever gets invoked. This whole interface was only recently added by
 Google folks and its usage is pretty obscure.

 Oh wait - you're _at_ Google, aren't you? Perhaps you're actually
 using clear_refs.

No. I'm just using Google Mail.

And I'm not using any additional patches apart from what is in the mm-patchset.

 Well I can see no reason why the vma we just got to by the mm-mmap
 would have a vm_mm != mm, but I've certainly been wrong before.

 Try changing it to:

 for (vma = mm-mmap; vma; vma = vma-vm_next)
 if (!is_vm_hugetlb_page(vma)) {
 if (vma-vm_mm != mm)
 printk(WTF: vma-vm_mm %p mm %p\n,
 vma-vm_mm, mm);
 walk_page_range(vma-vm_mm, vma-vm_start, 
 vma-vm_end,
 clear_refs_walk, vma);
 }

Will try this, but as the error only seems to trigger, if I leave the
pc powered off for more then a hour, you will need to wait until
tomorrow.

Of note might be, that at the time of this error init has not been
started. I'm using a program from initramfs to start the RAID.
The initramfs was primarily build using the genkernel package from
gentoo, but I have removed some parts of it (kernel modules and some
other part I no longer remember)

Torsten
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On Wed, Oct 03, 2007 at 07:36:55PM +0200, Torsten Kaiser wrote:
 On 10/3/07, Matt Mackall [EMAIL PROTECTED] wrote:
  On Wed, Oct 03, 2007 at 05:55:10PM +0200, Torsten Kaiser wrote:
   This patch removes clear_refs_smap() from fs/proc/task_mmu.c by moving
   its code to a new function. But during the move the main for-loop from
   clear_refs_smap was changed:
  
   old:
 for (vma = mm-mmap; vma; vma = vma-vm_next)
 if (vma-vm_mm  !is_vm_hugetlb_page(vma))
 walk_page_range(vma-vm_mm, vma-vm_start, 
   vma-vm_end,
 clear_refs_walk, vma);
  
   new:
 for (vma = mm-mmap; vma; vma = vma-vm_next)
 if (!is_vm_hugetlb_page(vma))
 walk_page_range(mm, vma-vm_start, vma-vm_end,
 clear_refs_walk, vma);
  
   The walk_page_range() is no longer called on vma-vm_mm, but on mm 
   directly.
   I don't know how this can kill the sata_sil24-driver, but at least it
   looks suspicious.
 
  That code should be fine. Further, it's pretty unlikely that this code
  ever gets invoked. This whole interface was only recently added by
  Google folks and its usage is pretty obscure.
 
  Oh wait - you're _at_ Google, aren't you? Perhaps you're actually
  using clear_refs.
 
 No. I'm just using Google Mail.

Ahh, I see. Wasn't aware of the trademark dispute..

 Of note might be, that at the time of this error init has not been
 started. I'm using a program from initramfs to start the RAID.
 The initramfs was primarily build using the genkernel package from
 gentoo, but I have removed some parts of it (kernel modules and some
 other part I no longer remember)

Then I really don't see how these patches can have anything to do with
your problem. They only affect some currently obscure files in /proc
and it's unlikely to have any impact at init time. It's more likely
that there's some code alignment Heisenbug or you got a false negative
somewhere in your bisection.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 10/3/07, Matt Mackall [EMAIL PROTECTED] wrote:
 On Wed, Oct 03, 2007 at 07:36:55PM +0200, Torsten Kaiser wrote:
  Of note might be, that at the time of this error init has not been
  started. I'm using a program from initramfs to start the RAID.
  The initramfs was primarily build using the genkernel package from
  gentoo, but I have removed some parts of it (kernel modules and some
  other part I no longer remember)

 Then I really don't see how these patches can have anything to do with
 your problem. They only affect some currently obscure files in /proc
 and it's unlikely to have any impact at init time. It's more likely
 that there's some code alignment Heisenbug or you got a false negative
 somewhere in your bisection.

Heisenbug via code alignment might be, but the bisection result seemed
rather reliable.
I have 4 good boots with all mm-patches before the maps2-* ones and 2
out of 3 boots failed with part or all maps2-* patches applied. (Any
many more failed boots with more mm-patches applied.)

But I do think it might be very possible that some (maybe even
correct) change in the behavior of this code triggers the error. So
I'm very interested in what you additional printk outputs to track
down what changed and why this breaks.

I looked at the scripts from the initramfs, but sadly I don't see
anything that crawls throug /proc and might hit your files. Other than
that there is nearly only busybox in there. cryptsetup should run
later and modprobe does not have any modules it could try to load...

Torsten
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 10/3/07, Matt Mackall [EMAIL PROTECTED] wrote:
 Well I can see no reason why the vma we just got to by the mm-mmap
 would have a vm_mm != mm, but I've certainly been wrong before.

 Try changing it to:

 for (vma = mm-mmap; vma; vma = vma-vm_next)
 if (!is_vm_hugetlb_page(vma)) {
 if (vma-vm_mm != mm)
 printk(WTF: vma-vm_mm %p mm %p\n,
 vma-vm_mm, mm);
 walk_page_range(vma-vm_mm, vma-vm_start, 
 vma-vm_end,
 clear_refs_walk, vma);
 }

You were right.
I was able to trigger the error with above printk added, but nothing
was written to the syslog.

So now I'm rather out of ideas what to test... :(

Torsten
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

2007-10-01 Thread Torsten Kaiser

On 9/30/07, Torsten Kaiser <[EMAIL PROTECTED]> wrote:
> On 9/30/07, Tejun Heo <[EMAIL PROTECTED]> wrote:
> > Torsten Kaiser wrote:
> > > What I find kind of interessing is, that while I got three different
> > > error codes the cmd part of the output was always the same.
> >
> > That's NCQ write command.  You'll be using it a lot if you're rebuilding
> > md5.
>
> It's not rebuilding the RAID at that point.
> If one drive fails, I reboot into a "safe" kernel, fix the RAID and
> that way try the next boot with a clean RAID again.
>
> The error happens when the RAID is initialized, might be the first
> write into the superblock to mark it dirty/inuse that triggers the
> error.

To make that comment "cmd part of the output was always the same" more
clear: I did not only mean that the first (cmd-)byte was the same, but
the whole line:
cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out

I did find ATA_CMD_FPDMA_WRITE = 0x61in ata.h, but I don't find a good
hint for the 08 that follows. The interpretation XFER_PIO_0 = 0x08
seems wrong...

> > It seems like something is going wrong with request DMA or sg
> > mapping.  Maybe some change in block/*.[hc]?
>
> The sg-chaining-patch stands out, but I have no conclusive proof that
> it really is the cause.
> As noted in this thread, a long time I thought that rc7 with the
> sg-chaining-patch was safe, but one time it also showed the error.

If I look at what patches remain, it seems that some other (earlier)
patch that is new in 2.6.23-rc4-mm1 is the trigger, but it will only
fail together with a second patch.

> > > [the 2.6.23-rc4-mm1 series-file has 2013 lines]
> > > Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
> > > Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
> > > (line779): 2 good boots

Up to (incl.) fs-remove-some-aop_truncated_page.patch (line 956): 2 good boots
Up to (incl.) update-n_high_memory-node-state-for-memory-hotadd.patch (line978)
1 good boot, so I would tend to say that this patches are also OK, but
I will recheck this version.

Up to (incl.) 
maps2-make-proc-pid-smaps-optional-under-config_embeddedpatch-fix.patch
(line 1038): 1 failed

> > > Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
> > > 1045): 1 failed
> > > Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 
> > > failed

Updated list of remaining, new patches in rc4:

> > > That means from the patches added to the rc4 variant of the mm-kernel
> > > the following are remaining:
> [snip]
> > > memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
> > > memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
> > > memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
> > > update-n_high_memory-node-state-for-memory-hotadd.patch

And I think these are OK too.
So I think some earlier patch triggers bug together with one of these patches:

categorize-gfp-flags.patch
categorize-gfp-flags-fix.patch
make-swappiness-safer-to-use.patch
... these should not change any behaivor from there description

> flush-cache-before-*
> Don't think so: No ia64 system, unchanged from rc3

> # grouping pages by mobility patches
> ... no idee, but seem unchanged

> maps2.*
> Don't think that related...

Will continue to break this down...

> As for you printk:
> From two goot boots, I had not had any failures with it:
> First one:
> Sep 30 19:24:53 treogen [3.81] XXX sil24 cb=810037ef
> cb_dma=37ef
> Sep 30 19:24:53 treogen [3.82] XXX sil24 cb=810037f0
> cb_dma=37f0
> Second:
> Sep 30 20:06:22 treogen [3.82] XXX sil24 cb=810037f0
> cb_dma=37f0
> Sep 30 20:06:22 treogen [3.83] XXX sil24 cb=810037f1
> cb_dma=37f1

Next boot with this kernel was also good:
Oct  1 06:52:32 treogen [3.74] XXX sil24 cb=810037f0
cb_dma=37f0
Oct  1 06:52:32 treogen [3.75] XXX sil24 cb=810037f1
cb_dma=37f1

Next kernel 
(maps2-make-proc-pid-smaps-optional-under-config_embeddedpatch-fix.patch)
Oct  1 18:07:09 treogen [3.76] XXX sil24 cb=810005df
cb_dma=5df
Oct  1 18:07:09 treogen [3.77] XXX sil24 cb=810005e0
cb_dma=5e0
-> failed, but when rebooting instead of a cold boot that kernel worked:
Oct  1 18:14:42 treogen [3.80] XXX sil24 cb=810005df
cb_dma=5df
Oct  1 18:14:42 treogen [3.81] XXX sil24 cb=810005e0
cb_dma=5e0
... same values, so thats not an indicator that trouble will follow.

Next kernel (update-n_high_memory-node-state-for-memory-hotadd.patch)
Oct  1 19:31:38 treogen [3.84] XXX sil24 cb=81007fa6
cb_dma=7fa6
Oct  1 19:31:38 treogen [3.85] XXX sil24 cb=810037f0
cb_dma=37f0
that was also a good boot.

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]

Re: sata_sil24 broken since 2.6.23-rc4-mm1

2007-10-01 Thread Torsten Kaiser

On 9/30/07, Torsten Kaiser [EMAIL PROTECTED] wrote:
 On 9/30/07, Tejun Heo [EMAIL PROTECTED] wrote:
  Torsten Kaiser wrote:
   What I find kind of interessing is, that while I got three different
   error codes the cmd part of the output was always the same.
 
  That's NCQ write command.  You'll be using it a lot if you're rebuilding
  md5.

 It's not rebuilding the RAID at that point.
 If one drive fails, I reboot into a safe kernel, fix the RAID and
 that way try the next boot with a clean RAID again.

 The error happens when the RAID is initialized, might be the first
 write into the superblock to mark it dirty/inuse that triggers the
 error.

To make that comment cmd part of the output was always the same more
clear: I did not only mean that the first (cmd-)byte was the same, but
the whole line:
cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out

I did find ATA_CMD_FPDMA_WRITE = 0x61in ata.h, but I don't find a good
hint for the 08 that follows. The interpretation XFER_PIO_0 = 0x08
seems wrong...

  It seems like something is going wrong with request DMA or sg
  mapping.  Maybe some change in block/*.[hc]?

 The sg-chaining-patch stands out, but I have no conclusive proof that
 it really is the cause.
 As noted in this thread, a long time I thought that rc7 with the
 sg-chaining-patch was safe, but one time it also showed the error.

If I look at what patches remain, it seems that some other (earlier)
patch that is new in 2.6.23-rc4-mm1 is the trigger, but it will only
fail together with a second patch.

   [the 2.6.23-rc4-mm1 series-file has 2013 lines]
   Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
   Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
   (line779): 2 good boots

Up to (incl.) fs-remove-some-aop_truncated_page.patch (line 956): 2 good boots
Up to (incl.) update-n_high_memory-node-state-for-memory-hotadd.patch (line978)
1 good boot, so I would tend to say that this patches are also OK, but
I will recheck this version.

Up to (incl.) 
maps2-make-proc-pid-smaps-optional-under-config_embeddedpatch-fix.patch
(line 1038): 1 failed

   Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
   1045): 1 failed
   Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 
   failed

Updated list of remaining, new patches in rc4:

   That means from the patches added to the rc4 variant of the mm-kernel
   the following are remaining:
 [snip]
   memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
   memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
   memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
   update-n_high_memory-node-state-for-memory-hotadd.patch

And I think these are OK too.
So I think some earlier patch triggers bug together with one of these patches:

categorize-gfp-flags.patch
categorize-gfp-flags-fix.patch
make-swappiness-safer-to-use.patch
... these should not change any behaivor from there description

 flush-cache-before-*
 Don't think so: No ia64 system, unchanged from rc3

 # grouping pages by mobility patches
 ... no idee, but seem unchanged

 maps2.*
 Don't think that related...

Will continue to break this down...

 As for you printk:
 From two goot boots, I had not had any failures with it:
 First one:
 Sep 30 19:24:53 treogen [3.81] XXX sil24 cb=810037ef
 cb_dma=37ef
 Sep 30 19:24:53 treogen [3.82] XXX sil24 cb=810037f0
 cb_dma=37f0
 Second:
 Sep 30 20:06:22 treogen [3.82] XXX sil24 cb=810037f0
 cb_dma=37f0
 Sep 30 20:06:22 treogen [3.83] XXX sil24 cb=810037f1
 cb_dma=37f1

Next boot with this kernel was also good:
Oct  1 06:52:32 treogen [3.74] XXX sil24 cb=810037f0
cb_dma=37f0
Oct  1 06:52:32 treogen [3.75] XXX sil24 cb=810037f1
cb_dma=37f1

Next kernel 
(maps2-make-proc-pid-smaps-optional-under-config_embeddedpatch-fix.patch)
Oct  1 18:07:09 treogen [3.76] XXX sil24 cb=810005df
cb_dma=5df
Oct  1 18:07:09 treogen [3.77] XXX sil24 cb=810005e0
cb_dma=5e0
- failed, but when rebooting instead of a cold boot that kernel worked:
Oct  1 18:14:42 treogen [3.80] XXX sil24 cb=810005df
cb_dma=5df
Oct  1 18:14:42 treogen [3.81] XXX sil24 cb=810005e0
cb_dma=5e0
... same values, so thats not an indicator that trouble will follow.

Next kernel (update-n_high_memory-node-state-for-memory-hotadd.patch)
Oct  1 19:31:38 treogen [3.84] XXX sil24 cb=81007fa6
cb_dma=7fa6
Oct  1 19:31:38 treogen [3.85] XXX sil24 cb=810037f0
cb_dma=37f0
that was also a good boot.

Torsten
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 9/30/07, Tejun Heo <[EMAIL PROTECTED]> wrote:
> Torsten Kaiser wrote:
> > What I find kind of interessing is, that while I got three different
> > error codes the cmd part of the output was always the same.
>
> That's NCQ write command.  You'll be using it a lot if you're rebuilding
> md5.

It's not rebuilding the RAID at that point.
If one drive fails, I reboot into a "safe" kernel, fix the RAID and
that way try the next boot with a clean RAID again.

The error happens when the RAID is initialized, might be the first
write into the superblock to mark it dirty/inuse that triggers the
error.

> It seems like something is going wrong with request DMA or sg
> mapping.  Maybe some change in block/*.[hc]?

The sg-chaining-patch stands out, but I have no conclusive proof that
it really is the cause.
As noted in this thread, a long time I thought that rc7 with the
sg-chaining-patch was safe, but one time it also showed the error.

> > It's not just 2.6.23-rc4-mm1. All -mm's after rc4 are broken for me.
> > Confirmed breakage on -rc4-mm1, -rc6-mm1 and -rc8-mm1. I'm just
> > narrowing on rc4-mm1 because that was the first version to break.
> >
> > I'm currently trying to bisect 2.6.23-rc4-mm1. Here is the current status:
>
> Have you tested 2.6.23-rc4 without mm patches?  It could be something
> introduced between -rc3 and 4.

Not directly, but I have 4 good boots with one part of the mm-patches.
So I would tend to say that mainline 2.6.23-rc4 does not have this
bug.

> > [the 2.6.23-rc4-mm1 series-file has 2013 lines]
> > Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
> > Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
> > (line779): 2 good boots
> > Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
> > 1045): 1 failed
> > Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 
> > failed
> >
> > Next try: up to patch fs-remove-some-aop_truncated_page.patch

Looks more like this is OK too.

> > That means from the patches added to the rc4 variant of the mm-kernel
> > the following are remaining:
[snip]
> > memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
> > memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
> > memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
> > update-n_high_memory-node-state-for-memory-hotadd.patch
> > slub-avoid-page-struct-cacheline-bouncing-due-to-remote-frees-to-cpu-slab.patch
> > slub-do-not-use-page-mapping.patch
> > slub-move-page-offset-to-kmem_cache_cpu-offset.patch
> > slub-avoid-touching-page-struct-when-freeing-to-per-cpu-slab.patch
> > slub-place-kmem_cache_cpu-structures-in-a-numa-aware-way.patch
> > slub-optimize-cacheline-use-for-zeroing.patch
> >
> > But due to the unreliable nature of the bug, I can't be to sure about that.
>
> Yeah, that's what I'm worried about.  Bisection is extremely difficult
> if errors are intermittent and takes long time to reproduce.

Yes...

As for the remaining patches:
memoryless-nodes-*
Don't think so: I do have a NUMA system, but both nodes have memory.
flush-cache-before-*
Don't think so: No ia64 system, unchanged from rc3
# grouping pages by mobility patches
... no idee, but seem unchanged
maps2.*
Don't think that related...
remaining slub-* patches
Might be...

As for you printk:
>From two goot boots, I had not had any failures with it:
First one:
Sep 30 19:24:53 treogen [3.81] XXX sil24 cb=810037ef
cb_dma=37ef
Sep 30 19:24:53 treogen [3.82] XXX sil24 cb=810037f0
cb_dma=37f0
Second:
Sep 30 20:06:22 treogen [3.82] XXX sil24 cb=810037f0
cb_dma=37f0
Sep 30 20:06:22 treogen [3.83] XXX sil24 cb=810037f1
cb_dma=37f1

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

Torsten Kaiser wrote:
> That boot ended in a minimal initrd environment that normally only
> starts the RAID5 and then opens contained encrypted real root.
> I was just able to push the output from dmesg through the serial link,
> but had no man pages to tell me about -s ...
> And that kind of error was until now a one-of-a-kind one. All other
> errors where not "internal error" but "timeout".
> But one time I had another SGT related error:
> Sep 11 19:19:24 treogen [   33.34] ata1.00: exception Emask 0x20
> SAct 0x1 SErr 0x0 action 0x2
> Sep 11 19:19:24 treogen [   33.34] ata1.00: irq_stat 0x00020002,
> PCI master abort while fetching SGT
> Sep 11 19:19:24 treogen [   33.34] ata1.00: cmd
> 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
> Sep 11 19:19:24 treogen [   33.34]  res
> 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask 0x20 (host bus error)
> 
> What I find kind of interessing is, that while I got three different
> error codes the cmd part of the output was always the same.

That's NCQ write command.  You'll be using it a lot if you're rebuilding
md5.  It seems like something is going wrong with request DMA or sg
mapping.  Maybe some change in block/*.[hc]?

> It's not just 2.6.23-rc4-mm1. All -mm's after rc4 are broken for me.
> Confirmed breakage on -rc4-mm1, -rc6-mm1 and -rc8-mm1. I'm just
> narrowing on rc4-mm1 because that was the first version to break.
> 
> I'm currently trying to bisect 2.6.23-rc4-mm1. Here is the current status:

Have you tested 2.6.23-rc4 without mm patches?  It could be something
introduced between -rc3 and 4.

> [the 2.6.23-rc4-mm1 series-file has 2013 lines]
> Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
> Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
> (line779): 2 good boots
> Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
> 1045): 1 failed
> Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 
> failed
> 
> Next try: up to patch fs-remove-some-aop_truncated_page.patch
> 
> That means from the patches added to the rc4 variant of the mm-kernel
> the following are remaining:
> git-xfs-build-fix.patch
> enforce-noreplace-smp-in-alternative_instructions.patch
> paravirt-fix-preemptible-lazy-mode-bug.patch
> i386-apic-fix-4-bit-apicid-assumption-of-mach-default.patch
> fix-the-max-path-calculation-in-radix-treec-update.patch
> mm-no-need-to-cast-vmalloc-return-value-in-zone_wait_table_init.patch
> introduce-write_begin-write_end-aops-fix2.patch
> implement-simple-fs-aops-fix.patch
> ext2-convert-to-new-aops-fix2.patch
> ext3-convert-to-new-aops-fix-fix.patch
> ext4-convert-to-new-aops-fix-fix.patch
> gfs2-convert-to-new-aops-fix.patch
> reiserfs-convert-to-new-aops-fix2.patch
> hostfs-convert-to-new-aops-fix-fix.patch
> ufs-convert-to-new-aops-fix2.patch
> sysv-convert-to-new-aops-fix2.patch
> minix-convert-to-new-aops-fix2.patch
> affs-convert-to-new-aops-fix-fix.patch
> memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
> memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
> memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
> update-n_high_memory-node-state-for-memory-hotadd.patch
> slub-avoid-page-struct-cacheline-bouncing-due-to-remote-frees-to-cpu-slab.patch
> slub-do-not-use-page-mapping.patch
> slub-move-page-offset-to-kmem_cache_cpu-offset.patch
> slub-avoid-touching-page-struct-when-freeing-to-per-cpu-slab.patch
> slub-place-kmem_cache_cpu-structures-in-a-numa-aware-way.patch
> slub-optimize-cacheline-use-for-zeroing.patch
> 
> But due to the unreliable nature of the bug, I can't be to sure about that.

Yeah, that's what I'm worried about.  Bisection is extremely difficult
if errors are intermittent and takes long time to reproduce.

> Next version is compiled, now again switching the PC off for an hour...

Thanks a lot.  Much appreciated.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 9/30/07, Tejun Heo <[EMAIL PROTECTED]> wrote:
> Hello, Torsten.
>
> Torsten Kaiser wrote:
> > On 9/28/07, Torsten Kaiser <[EMAIL PROTECTED]> wrote:
> >> So in case of -rc3-mm1 I'm pretty sure that it works.
> >
> > That's still the case.
>
> Ah... that's weird.  It would be much better if -rc3-mm1 is broken too.  :-P

It would be much better if it would break all the time. Then bisecting
would not be guesswork. :-P

> >> Not completely sure is if 2.6.23-rc7-sglist kernel works. I booted
> >> that 9 times, but from a quick look in /var/log/messages, I might not
> >> have hit the "correct" situation to trigger the error.
> >> That kernel is vanilla 2.6.23-rc7 plus the patch from
> >> http://www.kernel.org/pub/linux/kernel/people/tomo/misc/v2.6.23-rc7-sglist-arch.diff.bz2
> >> ( http://marc.info/?l=linux-ide=119055574826083=2 )
> >
> > That is no longer the case. Yesterday this kernel did also show the failure.
> > The error looked a little bit different, but happend at the same
> > location during the bootup.
> > Sadly dmesg overflowed and I was not able to capture the first error.
> >
> > [   53.462632] Freeing unused kernel memory: 332k freed
> > ite cache: enabled, read cache: enabled, doesn't support DPO or FUA
> > [   77.170903] ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x0 action 0x2
> > [   77.170905] ata1.00: irq_stat 0x00020002, SGT no on qword boundary
> > [   77.170908] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
> > cdb 0x0 data 4096 out
> > [   77.170909]  res 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask
> > 0x40 (internal error)
>
> Hmm... SGT not on qword boundary?  Please add the following to the end
> of sil24_port_start() and report successful and failed kernel boot log.
>
> printk("XXX sil24 cb=%p cb_dma=%llx\n",
> cb, (unsigned long long)cb_dma);

Added to my current "2.6.23-rc4-mm0.5"
I will report it's output.

> Also, does 'dmesg -s 1000' give full dmesg?

That boot ended in a minimal initrd environment that normally only
starts the RAID5 and then opens contained encrypted real root.
I was just able to push the output from dmesg through the serial link,
but had no man pages to tell me about -s ...
And that kind of error was until now a one-of-a-kind one. All other
errors where not "internal error" but "timeout".
But one time I had another SGT related error:
Sep 11 19:19:24 treogen [   33.34] ata1.00: exception Emask 0x20
SAct 0x1 SErr 0x0 action 0x2
Sep 11 19:19:24 treogen [   33.34] ata1.00: irq_stat 0x00020002,
PCI master abort while fetching SGT
Sep 11 19:19:24 treogen [   33.34] ata1.00: cmd
61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
Sep 11 19:19:24 treogen [   33.34]  res
50/00:00:af:ea:42/00:00:25:00:00/e0 Emask 0x20 (host bus error)

What I find kind of interessing is, that while I got three different
error codes the cmd part of the output was always the same.

> > Is it possible that the bug is much older, and only some change in
> > rc3-mm->rc4-mm makes it visible that the initialization of the SiI3132
> > is incomplete?
>
> I can't tell but there is a pretty large userbase of sil24/32 and you
> seem to be the only one to report this problem yet.  I think it might be
> coming somewhere else than libata or sata_sil24 itself.  Hmmm... It
> would be really great if you can break -rc3-mm1 too.  Correct behavior
> for something like that for just one -mm version sounds very weird.

It's not just 2.6.23-rc4-mm1. All -mm's after rc4 are broken for me.
Confirmed breakage on -rc4-mm1, -rc6-mm1 and -rc8-mm1. I'm just
narrowing on rc4-mm1 because that was the first version to break.

I'm currently trying to bisect 2.6.23-rc4-mm1. Here is the current status:

[the 2.6.23-rc4-mm1 series-file has 2013 lines]
Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
(line779): 2 good boots
Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
1045): 1 failed
Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 failed

Next try: up to patch fs-remove-some-aop_truncated_page.patch

That means from the patches added to the rc4 variant of the mm-kernel
the following are remaining:
git-xfs-build-fix.patch
enforce-noreplace-smp-in-alternative_instructions.patch
paravirt-fix-preemptible-lazy-mode-bug.patch
i386-apic-fix-4-bit-apicid-assumption-of-mach-default.patch
fix-the-max-path-calculation-in-radix-treec-update.patch
mm-no-need-to-cast-vmalloc-return-value-in-zone_wait_table_init.patch
introduce-write_begin-write_end-aops-fix2.patch
implement-simple-fs-aops-fix.patch
ext2-convert-to-new-aops-fix2.patch
ext3-convert-to-new-aops-fix-fix.patch
ext4-convert-to-new-aops-fix-fix.patch
gfs2-convert-to-new-aops-fix.patch
reiserfs-convert-to-new-aops-fix2.patch
hostfs-convert-to-new-aops-fix-fix.patch
ufs-convert-to-new-aops-fix2.patch
sysv-convert-to-new-aops-fix2.patch

Re: sata_sil24 broken since 2.6.23-rc4-mm1

Hello, Torsten.

Torsten Kaiser wrote:
> On 9/28/07, Torsten Kaiser <[EMAIL PROTECTED]> wrote:
>> So in case of -rc3-mm1 I'm pretty sure that it works.
> 
> That's still the case.

Ah... that's weird.  It would be much better if -rc3-mm1 is broken too.  :-P

>> Not completely sure is if 2.6.23-rc7-sglist kernel works. I booted
>> that 9 times, but from a quick look in /var/log/messages, I might not
>> have hit the "correct" situation to trigger the error.
>> That kernel is vanilla 2.6.23-rc7 plus the patch from
>> http://www.kernel.org/pub/linux/kernel/people/tomo/misc/v2.6.23-rc7-sglist-arch.diff.bz2
>> ( http://marc.info/?l=linux-ide=119055574826083=2 )
> 
> That is no longer the case. Yesterday this kernel did also show the failure.
> The error looked a little bit different, but happend at the same
> location during the bootup.
> Sadly dmesg overflowed and I was not able to capture the first error.
> 
> [   53.462632] Freeing unused kernel memory: 332k freed
> ite cache: enabled, read cache: enabled, doesn't support DPO or FUA
> [   77.170903] ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x0 action 0x2
> [   77.170905] ata1.00: irq_stat 0x00020002, SGT no on qword boundary
> [   77.170908] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
> cdb 0x0 data 4096 out
> [   77.170909]  res 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask
> 0x40 (internal error)

Hmm... SGT not on qword boundary?  Please add the following to the end
of sil24_port_start() and report successful and failed kernel boot log.

printk("XXX sil24 cb=%p cb_dma=%llx\n",
cb, (unsigned long long)cb_dma);

Also, does 'dmesg -s 1000' give full dmesg?

> I rebooted into a system (kernel 2.6.21-rc5-mm2, please not the
> 2.6.*21*, that is only a rescue system and not updated often) on a
> separate partition to rebuild it. When I tried to readd the failed sdb
> the system locked up with this error:
> 
> Sep 29 22:25:37 treogen [  205.407893] ata2.00: exception Emask 0x0
> SAct 0x0 SErr 0x0 action 0x2 frozen
> Sep 29 22:25:37 treogen [  205.407900] ata2.00: cmd
> ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
> Sep 29 22:25:37 treogen [  205.407901]  res
> 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (time out)
> 
> Is it possible that the bug is much older, and only some change in
> rc3-mm->rc4-mm makes it visible that the initialization of the SiI3132
> is incomplete?

I can't tell but there is a pretty large userbase of sil24/32 and you
seem to be the only one to report this problem yet.  I think it might be
coming somewhere else than libata or sata_sil24 itself.  Hmmm... It
would be really great if you can break -rc3-mm1 too.  Correct behavior
for something like that for just one -mm version sounds very weird.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 9/28/07, Torsten Kaiser <[EMAIL PROTECTED]> wrote:
> So in case of -rc3-mm1 I'm pretty sure that it works.

That's still the case.

> Not completely sure is if 2.6.23-rc7-sglist kernel works. I booted
> that 9 times, but from a quick look in /var/log/messages, I might not
> have hit the "correct" situation to trigger the error.
> That kernel is vanilla 2.6.23-rc7 plus the patch from
> http://www.kernel.org/pub/linux/kernel/people/tomo/misc/v2.6.23-rc7-sglist-arch.diff.bz2
> ( http://marc.info/?l=linux-ide=119055574826083=2 )

That is no longer the case. Yesterday this kernel did also show the failure.
The error looked a little bit different, but happend at the same
location during the bootup.
Sadly dmesg overflowed and I was not able to capture the first error.

[   53.462632] Freeing unused kernel memory: 332k freed
ite cache: enabled, read cache: enabled, doesn't support DPO or FUA
[   77.170903] ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x0 action 0x2
[   77.170905] ata1.00: irq_stat 0x00020002, SGT no on qword boundary
[   77.170908] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
[   77.170909]  res 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask
0x40 (internal error)
[   77.504292] ata1: soft resetting port
[   77.544221] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[   77.642567] ata1.00: configured for UDMA/100
[   77.642572] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE,SUGGEST_OK
[   77.642574] sd 0:0:0:0: [sda] Sense Key : Aborted Command [current]
[descriptor]
[   77.642578] Descriptor sense data with sense descriptors (in hex):
[   77.642580] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
[   77.642584] 00 00 00 af
[   77.642586] sd 0:0:0:0: [sda] Add. Sense: No additional sense information
[   77.642589] end_request: I/O error, dev sda, sector 625137161
[   77.642593] ata1: EH complete
[   77.642609] sd 0:0:0:0: [sda] 625142448 512-byte hardware sectors (320073 MB)
[   77.642616] sd 0:0:0:0: [sda] Write Protect is off
[   77.642618] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[   77.642628] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[   77.666927] md: super_written gets error=-5, uptodate=0
[   77.666930] raid5: Disk failure on sda2, disabling device.
Operation continuing on 1 devices

That "Operation continuing on 1 devices" is a 'little' bit misleading.
A RAID5 with two failed devices will not continue to operate. :(
The same error is repeated several times, so I expect the first error
also looked like that.

Other things I have done to narrow it down:
Comparing 2.6.23-rc3-mm1 and 2.6.23-rc4-mm1 I found the following hunk
from libata-core.c:
@@ -2182,6 +2183,17 @@
if (ap->ops->cable_detect)
ap->cbl = ap->ops->cable_detect(ap);

+   /* We may have SATA bridge glue hiding here irrespective of the
+  reported cable types and sensed types */
+   ata_link_for_each_dev(dev, >link) {
+   if (!ata_dev_enabled(dev))
+   continue;
+   /* SATA drives indicate we have a bridge. We don't know which
+  end of the link the bridge is which is a problem */
+   if (ata_id_is_sata(dev->id))
+   ap->cbl = ATA_CBL_SATA;
+   }
+

But removing this from the current -mm does still show the error, so
thats not the source of the bug. (That test kicked the first drive
(sdb) from the RAID5 with the same error as most times)

Rebooting into the "safe" rc7-sglist-kernel I got the above error that
killed the RAID5.
I rebooted into a system (kernel 2.6.21-rc5-mm2, please not the
2.6.*21*, that is only a rescue system and not updated often) on a
separate partition to rebuild it. When I tried to readd the failed sdb
the system locked up with this error:

Sep 29 22:25:37 treogen [  205.407893] ata2.00: exception Emask 0x0
SAct 0x0 SErr 0x0 action 0x2 frozen
Sep 29 22:25:37 treogen [  205.407900] ata2.00: cmd
ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
Sep 29 22:25:37 treogen [  205.407901]  res
40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (time out)
Sep 29 22:25:38 treogen [  205.720531] ata2: soft resetting port
Sep 29 22:25:38 treogen [  205.720554] ata2: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:25:38 treogen [  205.730560] ATA: abnormal status 0xEA on
port 0xc20001432007
Sep 29 22:25:38 treogen [  205.740560] ATA: abnormal status 0xEA on
port 0xc20001432007
Sep 29 22:26:07 treogen [  235.343001] ata1.00: exception Emask 0x0
SAct 0x1 SErr 0x0 action 0x2 frozen
Sep 29 22:26:07 treogen [  235.343007] ata1.00: cmd
61/08:00:bf:52:2a/00:00:01:00:00/40 tag 0 cdb 0x0 data 4096 out
Sep 29 22:26:07 treogen [  235.343009]  res
40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (time out)
Sep 29 22:26:08 treogen [  235.655644] ata1: soft resetting port
Sep 29 22:26:08 treogen [  235.655666] ata1: SATA link up 3.0 Gbps

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 9/28/07, Torsten Kaiser [EMAIL PROTECTED] wrote:
 So in case of -rc3-mm1 I'm pretty sure that it works.

That's still the case.

 Not completely sure is if 2.6.23-rc7-sglist kernel works. I booted
 that 9 times, but from a quick look in /var/log/messages, I might not
 have hit the correct situation to trigger the error.
 That kernel is vanilla 2.6.23-rc7 plus the patch from
 http://www.kernel.org/pub/linux/kernel/people/tomo/misc/v2.6.23-rc7-sglist-arch.diff.bz2
 ( http://marc.info/?l=linux-idem=119055574826083w=2 )

That is no longer the case. Yesterday this kernel did also show the failure.
The error looked a little bit different, but happend at the same
location during the bootup.
Sadly dmesg overflowed and I was not able to capture the first error.

[   53.462632] Freeing unused kernel memory: 332k freed
ite cache: enabled, read cache: enabled, doesn't support DPO or FUA
[   77.170903] ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x0 action 0x2
[   77.170905] ata1.00: irq_stat 0x00020002, SGT no on qword boundary
[   77.170908] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
[   77.170909]  res 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask
0x40 (internal error)
[   77.504292] ata1: soft resetting port
[   77.544221] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[   77.642567] ata1.00: configured for UDMA/100
[   77.642572] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE,SUGGEST_OK
[   77.642574] sd 0:0:0:0: [sda] Sense Key : Aborted Command [current]
[descriptor]
[   77.642578] Descriptor sense data with sense descriptors (in hex):
[   77.642580] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
[   77.642584] 00 00 00 af
[   77.642586] sd 0:0:0:0: [sda] Add. Sense: No additional sense information
[   77.642589] end_request: I/O error, dev sda, sector 625137161
[   77.642593] ata1: EH complete
[   77.642609] sd 0:0:0:0: [sda] 625142448 512-byte hardware sectors (320073 MB)
[   77.642616] sd 0:0:0:0: [sda] Write Protect is off
[   77.642618] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[   77.642628] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[   77.666927] md: super_written gets error=-5, uptodate=0
[   77.666930] raid5: Disk failure on sda2, disabling device.
Operation continuing on 1 devices

That Operation continuing on 1 devices is a 'little' bit misleading.
A RAID5 with two failed devices will not continue to operate. :(
The same error is repeated several times, so I expect the first error
also looked like that.

Other things I have done to narrow it down:
Comparing 2.6.23-rc3-mm1 and 2.6.23-rc4-mm1 I found the following hunk
from libata-core.c:
@@ -2182,6 +2183,17 @@
if (ap-ops-cable_detect)
ap-cbl = ap-ops-cable_detect(ap);

+   /* We may have SATA bridge glue hiding here irrespective of the
+  reported cable types and sensed types */
+   ata_link_for_each_dev(dev, ap-link) {
+   if (!ata_dev_enabled(dev))
+   continue;
+   /* SATA drives indicate we have a bridge. We don't know which
+  end of the link the bridge is which is a problem */
+   if (ata_id_is_sata(dev-id))
+   ap-cbl = ATA_CBL_SATA;
+   }
+

But removing this from the current -mm does still show the error, so
thats not the source of the bug. (That test kicked the first drive
(sdb) from the RAID5 with the same error as most times)

Rebooting into the safe rc7-sglist-kernel I got the above error that
killed the RAID5.
I rebooted into a system (kernel 2.6.21-rc5-mm2, please not the
2.6.*21*, that is only a rescue system and not updated often) on a
separate partition to rebuild it. When I tried to readd the failed sdb
the system locked up with this error:

Sep 29 22:25:37 treogen [  205.407893] ata2.00: exception Emask 0x0
SAct 0x0 SErr 0x0 action 0x2 frozen
Sep 29 22:25:37 treogen [  205.407900] ata2.00: cmd
ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
Sep 29 22:25:37 treogen [  205.407901]  res
40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (time out)
Sep 29 22:25:38 treogen [  205.720531] ata2: soft resetting port
Sep 29 22:25:38 treogen [  205.720554] ata2: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:25:38 treogen [  205.730560] ATA: abnormal status 0xEA on
port 0xc20001432007
Sep 29 22:25:38 treogen [  205.740560] ATA: abnormal status 0xEA on
port 0xc20001432007
Sep 29 22:26:07 treogen [  235.343001] ata1.00: exception Emask 0x0
SAct 0x1 SErr 0x0 action 0x2 frozen
Sep 29 22:26:07 treogen [  235.343007] ata1.00: cmd
61/08:00:bf:52:2a/00:00:01:00:00/40 tag 0 cdb 0x0 data 4096 out
Sep 29 22:26:07 treogen [  235.343009]  res
40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (time out)
Sep 29 22:26:08 treogen [  235.655644] ata1: soft resetting port
Sep 29 22:26:08 treogen [  235.655666] ata1: SATA link up 3.0 Gbps
(SStatus 123 SControl

Re: sata_sil24 broken since 2.6.23-rc4-mm1

Hello, Torsten.

Torsten Kaiser wrote:
 On 9/28/07, Torsten Kaiser [EMAIL PROTECTED] wrote:
 So in case of -rc3-mm1 I'm pretty sure that it works.
 
 That's still the case.

Ah... that's weird.  It would be much better if -rc3-mm1 is broken too.  :-P

 Not completely sure is if 2.6.23-rc7-sglist kernel works. I booted
 that 9 times, but from a quick look in /var/log/messages, I might not
 have hit the correct situation to trigger the error.
 That kernel is vanilla 2.6.23-rc7 plus the patch from
 http://www.kernel.org/pub/linux/kernel/people/tomo/misc/v2.6.23-rc7-sglist-arch.diff.bz2
 ( http://marc.info/?l=linux-idem=119055574826083w=2 )
 
 That is no longer the case. Yesterday this kernel did also show the failure.
 The error looked a little bit different, but happend at the same
 location during the bootup.
 Sadly dmesg overflowed and I was not able to capture the first error.
 
 [   53.462632] Freeing unused kernel memory: 332k freed
 ite cache: enabled, read cache: enabled, doesn't support DPO or FUA
 [   77.170903] ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x0 action 0x2
 [   77.170905] ata1.00: irq_stat 0x00020002, SGT no on qword boundary
 [   77.170908] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
 cdb 0x0 data 4096 out
 [   77.170909]  res 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask
 0x40 (internal error)

Hmm... SGT not on qword boundary?  Please add the following to the end
of sil24_port_start() and report successful and failed kernel boot log.

printk(XXX sil24 cb=%p cb_dma=%llx\n,
cb, (unsigned long long)cb_dma);

Also, does 'dmesg -s 1000' give full dmesg?

 I rebooted into a system (kernel 2.6.21-rc5-mm2, please not the
 2.6.*21*, that is only a rescue system and not updated often) on a
 separate partition to rebuild it. When I tried to readd the failed sdb
 the system locked up with this error:
 
 Sep 29 22:25:37 treogen [  205.407893] ata2.00: exception Emask 0x0
 SAct 0x0 SErr 0x0 action 0x2 frozen
 Sep 29 22:25:37 treogen [  205.407900] ata2.00: cmd
 ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
 Sep 29 22:25:37 treogen [  205.407901]  res
 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (time out)
 
 Is it possible that the bug is much older, and only some change in
 rc3-mm-rc4-mm makes it visible that the initialization of the SiI3132
 is incomplete?

I can't tell but there is a pretty large userbase of sil24/32 and you
seem to be the only one to report this problem yet.  I think it might be
coming somewhere else than libata or sata_sil24 itself.  Hmmm... It
would be really great if you can break -rc3-mm1 too.  Correct behavior
for something like that for just one -mm version sounds very weird.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 9/30/07, Tejun Heo [EMAIL PROTECTED] wrote:
Hello, Torsten.

Torsten Kaiser wrote:
On 9/28/07, Torsten Kaiser [EMAIL PROTECTED] wrote:
So in case of -rc3-mm1 I'm pretty sure that it works.

That's still the case.

Ah... that's weird. It would be much better if -rc3-mm1 is broken too. :-P

It would be much better if it would break all the time. Then bisecting
would not be guesswork. :-P

Not completely sure is if 2.6.23-rc7-sglist kernel works. I booted
that 9 times, but from a quick look in /var/log/messages, I might not
have hit the correct situation to trigger the error.
That kernel is vanilla 2.6.23-rc7 plus the patch from
http://www.kernel.org/pub/linux/kernel/people/tomo/misc/v2.6.23-rc7-sglist-arch.diff.bz2
( http://marc.info/?l=linux-idem=119055574826083w=2 )

That is no longer the case. Yesterday this kernel did also show the failure.
The error looked a little bit different, but happend at the same
location during the bootup.
Sadly dmesg overflowed and I was not able to capture the first error.

[ 53.462632] Freeing unused kernel memory: 332k freed
ite cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 77.170903] ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x0 action 0x2
[ 77.170905] ata1.00: irq_stat 0x00020002, SGT no on qword boundary
[ 77.170908] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
[ 77.170909] res 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask
0x40 (internal error)

Hmm... SGT not on qword boundary? Please add the following to the end
of sil24_port_start() and report successful and failed kernel boot log.

printk(XXX sil24 cb=%p cb_dma=%llx\n,
cb, (unsigned long long)cb_dma);

Added to my current 2.6.23-rc4-mm0.5
I will report it's output.

Also, does 'dmesg -s 1000' give full dmesg?

That boot ended in a minimal initrd environment that normally only
starts the RAID5 and then opens contained encrypted real root.
I was just able to push the output from dmesg through the serial link,
but had no man pages to tell me about -s ...
And that kind of error was until now a one-of-a-kind one. All other
errors where not internal error but timeout.
But one time I had another SGT related error:
Sep 11 19:19:24 treogen [ 33.34] ata1.00: exception Emask 0x20
SAct 0x1 SErr 0x0 action 0x2
Sep 11 19:19:24 treogen [ 33.34] ata1.00: irq_stat 0x00020002,
PCI master abort while fetching SGT
Sep 11 19:19:24 treogen [ 33.34] ata1.00: cmd
61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
Sep 11 19:19:24 treogen [ 33.34] res
50/00:00:af:ea:42/00:00:25:00:00/e0 Emask 0x20 (host bus error)

What I find kind of interessing is, that while I got three different
error codes the cmd part of the output was always the same.

Is it possible that the bug is much older, and only some change in
rc3-mm-rc4-mm makes it visible that the initialization of the SiI3132
is incomplete?

I can't tell but there is a pretty large userbase of sil24/32 and you
seem to be the only one to report this problem yet. I think it might be
coming somewhere else than libata or sata_sil24 itself. Hmmm... It
would be really great if you can break -rc3-mm1 too. Correct behavior
for something like that for just one -mm version sounds very weird.

It's not just 2.6.23-rc4-mm1. All -mm's after rc4 are broken for me.
Confirmed breakage on -rc4-mm1, -rc6-mm1 and -rc8-mm1. I'm just
narrowing on rc4-mm1 because that was the first version to break.

I'm currently trying to bisect 2.6.23-rc4-mm1. Here is the current status:

[the 2.6.23-rc4-mm1 series-file has 2013 lines]
Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
(line779): 2 good boots
Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
1045): 1 failed
Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 failed

Next try: up to patch fs-remove-some-aop_truncated_page.patch

That means from the patches added to the rc4 variant of the mm-kernel
the following are remaining:
git-xfs-build-fix.patch
enforce-noreplace-smp-in-alternative_instructions.patch
paravirt-fix-preemptible-lazy-mode-bug.patch
i386-apic-fix-4-bit-apicid-assumption-of-mach-default.patch
fix-the-max-path-calculation-in-radix-treec-update.patch
mm-no-need-to-cast-vmalloc-return-value-in-zone_wait_table_init.patch
introduce-write_begin-write_end-aops-fix2.patch
implement-simple-fs-aops-fix.patch
ext2-convert-to-new-aops-fix2.patch
ext3-convert-to-new-aops-fix-fix.patch
ext4-convert-to-new-aops-fix-fix.patch
gfs2-convert-to-new-aops-fix.patch
reiserfs-convert-to-new-aops-fix2.patch
hostfs-convert-to-new-aops-fix-fix.patch
ufs-convert-to-new-aops-fix2.patch
sysv-convert-to-new-aops-fix2.patch
minix-convert-to-new-aops-fix2.patch
affs-convert-to-new-aops-fix-fix.patch

Re: sata_sil24 broken since 2.6.23-rc4-mm1

Torsten Kaiser wrote:
 That boot ended in a minimal initrd environment that normally only
 starts the RAID5 and then opens contained encrypted real root.
 I was just able to push the output from dmesg through the serial link,
 but had no man pages to tell me about -s ...
 And that kind of error was until now a one-of-a-kind one. All other
 errors where not internal error but timeout.
 But one time I had another SGT related error:
 Sep 11 19:19:24 treogen [   33.34] ata1.00: exception Emask 0x20
 SAct 0x1 SErr 0x0 action 0x2
 Sep 11 19:19:24 treogen [   33.34] ata1.00: irq_stat 0x00020002,
 PCI master abort while fetching SGT
 Sep 11 19:19:24 treogen [   33.34] ata1.00: cmd
 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
 Sep 11 19:19:24 treogen [   33.34]  res
 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask 0x20 (host bus error)
 
 What I find kind of interessing is, that while I got three different
 error codes the cmd part of the output was always the same.

That's NCQ write command.  You'll be using it a lot if you're rebuilding
md5.  It seems like something is going wrong with request DMA or sg
mapping.  Maybe some change in block/*.[hc]?

 It's not just 2.6.23-rc4-mm1. All -mm's after rc4 are broken for me.
 Confirmed breakage on -rc4-mm1, -rc6-mm1 and -rc8-mm1. I'm just
 narrowing on rc4-mm1 because that was the first version to break.
 
 I'm currently trying to bisect 2.6.23-rc4-mm1. Here is the current status:

Have you tested 2.6.23-rc4 without mm patches?  It could be something
introduced between -rc3 and 4.

 [the 2.6.23-rc4-mm1 series-file has 2013 lines]
 Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
 Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
 (line779): 2 good boots
 Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
 1045): 1 failed
 Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 
 failed
 
 Next try: up to patch fs-remove-some-aop_truncated_page.patch
 
 That means from the patches added to the rc4 variant of the mm-kernel
 the following are remaining:
 git-xfs-build-fix.patch
 enforce-noreplace-smp-in-alternative_instructions.patch
 paravirt-fix-preemptible-lazy-mode-bug.patch
 i386-apic-fix-4-bit-apicid-assumption-of-mach-default.patch
 fix-the-max-path-calculation-in-radix-treec-update.patch
 mm-no-need-to-cast-vmalloc-return-value-in-zone_wait_table_init.patch
 introduce-write_begin-write_end-aops-fix2.patch
 implement-simple-fs-aops-fix.patch
 ext2-convert-to-new-aops-fix2.patch
 ext3-convert-to-new-aops-fix-fix.patch
 ext4-convert-to-new-aops-fix-fix.patch
 gfs2-convert-to-new-aops-fix.patch
 reiserfs-convert-to-new-aops-fix2.patch
 hostfs-convert-to-new-aops-fix-fix.patch
 ufs-convert-to-new-aops-fix2.patch
 sysv-convert-to-new-aops-fix2.patch
 minix-convert-to-new-aops-fix2.patch
 affs-convert-to-new-aops-fix-fix.patch
 memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
 memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
 memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
 update-n_high_memory-node-state-for-memory-hotadd.patch
 slub-avoid-page-struct-cacheline-bouncing-due-to-remote-frees-to-cpu-slab.patch
 slub-do-not-use-page-mapping.patch
 slub-move-page-offset-to-kmem_cache_cpu-offset.patch
 slub-avoid-touching-page-struct-when-freeing-to-per-cpu-slab.patch
 slub-place-kmem_cache_cpu-structures-in-a-numa-aware-way.patch
 slub-optimize-cacheline-use-for-zeroing.patch
 
 But due to the unreliable nature of the bug, I can't be to sure about that.

Yeah, that's what I'm worried about.  Bisection is extremely difficult
if errors are intermittent and takes long time to reproduce.

 Next version is compiled, now again switching the PC off for an hour...

Thanks a lot.  Much appreciated.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 9/30/07, Tejun Heo [EMAIL PROTECTED] wrote:
 Torsten Kaiser wrote:
  What I find kind of interessing is, that while I got three different
  error codes the cmd part of the output was always the same.

 That's NCQ write command.  You'll be using it a lot if you're rebuilding
 md5.

It's not rebuilding the RAID at that point.
If one drive fails, I reboot into a safe kernel, fix the RAID and
that way try the next boot with a clean RAID again.

The error happens when the RAID is initialized, might be the first
write into the superblock to mark it dirty/inuse that triggers the
error.

 It seems like something is going wrong with request DMA or sg
 mapping.  Maybe some change in block/*.[hc]?

The sg-chaining-patch stands out, but I have no conclusive proof that
it really is the cause.
As noted in this thread, a long time I thought that rc7 with the
sg-chaining-patch was safe, but one time it also showed the error.

  It's not just 2.6.23-rc4-mm1. All -mm's after rc4 are broken for me.
  Confirmed breakage on -rc4-mm1, -rc6-mm1 and -rc8-mm1. I'm just
  narrowing on rc4-mm1 because that was the first version to break.
 
  I'm currently trying to bisect 2.6.23-rc4-mm1. Here is the current status:

 Have you tested 2.6.23-rc4 without mm patches?  It could be something
 introduced between -rc3 and 4.

Not directly, but I have 4 good boots with one part of the mm-patches.
So I would tend to say that mainline 2.6.23-rc4 does not have this
bug.

  [the 2.6.23-rc4-mm1 series-file has 2013 lines]
  Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
  Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
  (line779): 2 good boots
  Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
  1045): 1 failed
  Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 
  failed
 
  Next try: up to patch fs-remove-some-aop_truncated_page.patch

Looks more like this is OK too.

  That means from the patches added to the rc4 variant of the mm-kernel
  the following are remaining:
[snip]
  memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
  memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
  memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
  update-n_high_memory-node-state-for-memory-hotadd.patch
  slub-avoid-page-struct-cacheline-bouncing-due-to-remote-frees-to-cpu-slab.patch
  slub-do-not-use-page-mapping.patch
  slub-move-page-offset-to-kmem_cache_cpu-offset.patch
  slub-avoid-touching-page-struct-when-freeing-to-per-cpu-slab.patch
  slub-place-kmem_cache_cpu-structures-in-a-numa-aware-way.patch
  slub-optimize-cacheline-use-for-zeroing.patch
 
  But due to the unreliable nature of the bug, I can't be to sure about that.

 Yeah, that's what I'm worried about.  Bisection is extremely difficult
 if errors are intermittent and takes long time to reproduce.

Yes...

As for the remaining patches:
memoryless-nodes-*
Don't think so: I do have a NUMA system, but both nodes have memory.
flush-cache-before-*
Don't think so: No ia64 system, unchanged from rc3
# grouping pages by mobility patches
... no idee, but seem unchanged
maps2.*
Don't think that related...
remaining slub-* patches
Might be...

As for you printk:
From two goot boots, I had not had any failures with it:
First one:
Sep 30 19:24:53 treogen [3.81] XXX sil24 cb=810037ef
cb_dma=37ef
Sep 30 19:24:53 treogen [3.82] XXX sil24 cb=810037f0
cb_dma=37f0
Second:
Sep 30 20:06:22 treogen [3.82] XXX sil24 cb=810037f0
cb_dma=37f0
Sep 30 20:06:22 treogen [3.83] XXX sil24 cb=810037f1
cb_dma=37f1

Torsten
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 9/27/07, Tejun Heo <[EMAIL PROTECTED]> wrote:
> Torsten Kaiser wrote:
> > Known good is for me 2.6.23-rc3-mm1, the first known bad is 2.6.23-rc4-mm1.
> > I will try to look at the diff between these revisions some more, but
> > the change in sata_sil24.c looked like a perfect match for the
> > symptoms I was seeing.
>
> I think the first thing to do here is to verify 2.6.23-rc3-mm1 still
> works fine and my previous debug patch is pretty much meaningless if
> address initialization failure isn't the cause.

After the first trouble with -rc4-mm1 I switched back to -rc3-mm1. I
booted that kernel 7 times over 4 days and never had trouble. (Before
-rc4-mm1 came out, I used -rc3-mm1 for over a week)

So in case of -rc3-mm1 I'm pretty sure that it works.

Not completely sure is if 2.6.23-rc7-sglist kernel works. I booted
that 9 times, but from a quick look in /var/log/messages, I might not
have hit the "correct" situation to trigger the error.
That kernel is vanilla 2.6.23-rc7 plus the patch from
http://www.kernel.org/pub/linux/kernel/people/tomo/misc/v2.6.23-rc7-sglist-arch.diff.bz2
( http://marc.info/?l=linux-ide=119055574826083=2 )

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

2007-09-27 Thread Tejun Heo

Torsten Kaiser wrote:
> Known good is for me 2.6.23-rc3-mm1, the first known bad is 2.6.23-rc4-mm1.
> I will try to look at the diff between these revisions some more, but
> the change in sata_sil24.c looked like a perfect match for the
> symptoms I was seeing.

I think the first thing to do here is to verify 2.6.23-rc3-mm1 still
works fine and my previous debug patch is pretty much meaningless if
address initialization failure isn't the cause.

Thanks.

-- 
tejun

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 9/27/07, Jeff Garzik <[EMAIL PROTECTED]> wrote:
> Torsten Kaiser wrote:
> > I compared the dmesg form good and bad boots with -rc7-mm1 but could
> > not see any difference, so do you think that these additional
> > diagnostics could show a difference?
> > Or could you suggest any other debugging options I should try?
>
> I think since its a reproducible problem, I think it's easiest to get
> you straight to git-bisect.  In this case, that would be

Sorry, but I don't think that will work.
It seems that I am able to reproduce the bug, but not reliable. And my
current best guess to make it happen involves the step "leaf the
computer powered off for 8 hours".
I estimate that even with the 8 hour pause only at ~50% of the boots
one drive fails. So I have no safe point to mark a bisect step as
'good'.

> a) start with a known good point (v2.6.22? v2.6.23?) and
> known bad point (HEAD, aka the most recent commit in
> libata-dev.git#upstream)

Known good is for me 2.6.23-rc3-mm1, the first known bad is 2.6.23-rc4-mm1.
I will try to look at the diff between these revisions some more, but
the change in sata_sil24.c looked like a perfect match for the
symptoms I was seeing.

What I just noticed, as I wanted two re-add the drive to the RAID:
This time it was not sda, but sdb that was kicked. But otherwise the
errors are perfectly identical.

I will try to make a 2.6.23-rc3.5-mm1 to narrow it down some more...

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

2007-09-27 Thread Jeff Garzik

Torsten Kaiser wrote:

On 9/27/07, Tejun Heo <[EMAIL PROTECTED]> wrote:

Tejun Heo wrote:

Torsten Kaiser wrote:

Comparing the driver/ata directory from rc3-mm1 and rc4-mm1 the
following change looked the most suspicions to me:
http://git.kernel.org/?p=linux/kernel/git/jgarzik/libata-dev.git;a=blobdiff;f=drivers/ata/sata_sil24.c;h=3dcb223117be9739ee04d70b6bfc776a4b839a3f;hp=e0cd31aa8002350add53ba6ff07493e503275244;hb=020bc1bd8d369a77bd9379cd9763ac0057651753;hpb=8d4bdf8087e682df98bdb856f6ad451bf6d597e7

That after rc4-mm1 the sata_sil24.c did not change anymore also
matches the occurrence of the error.

To confirm my theorie I exchanged the sata_sil24.c from rc8-mm1 with
the version from rc3-mm1.
I was able to boot the resulting kernel successfully 5 times, without
the error happening again.

Thanks a lot for chasing down the problem. The changed code is address
initialization path and it's weird that it causes intermittent failures,
not a consistent one.

Anyways, does the attached patch fix the problem?

I'm starting to *really* hate that bug.
My analysis was wrong, as I booted to modified 2.6.23-rc8-mm1 this
morning, that failed too. (Same error messages as -rc7-mm1 from the
first mail in this thread.)
So it's not that change that causes the breakage.

And I'm not really finding a good pattern to what boots fail and what work.
It seems to only fail, if I completely power off the system for
several hours. (Using the physical switch at the backside of the
powersupply, not the normal soft-off)

One of the five boots I tried yesterday, I also powered the system
completely off that way, but only leaving it off ~10..20 seconds
seemed not to trigger the bug.

But I still think that is not a hardware failure, as the -rc3-mm1
kernel never showed that error, even when I used it several times
after the first -rc4-mm1 failures.

If not, can you add printk of iomap[SIL24_PORT_BAR], offset, initialized
cmd_addr and scr_addr in the loop and see whether anything is different
between when the driver works and fails.

Should I do this anyway?

I compared the dmesg form good and bad boots with -rc7-mm1 but could
not see any difference, so do you think that these additional
diagnostics could show a difference?
Or could you suggest any other debugging options I should try?

I think since its a reproducible problem, I think it's easiest to get
you straight to git-bisect. In this case, that would be

1. Clone branch "upstream" of
git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git

2. Test. If bug persists, you have narrowed down the problem to the -mm
changes from the SATA developers, that are to be sent for 2.6.24. If
the problem does not persist, then it's a problem added in the -mm
patchset alone, which carries few ATA patches outside of libata-dev.git.

3. If the problem is in libata-dev.git#upstream (likely), you can now
use git-bisect to find the specific commit that causes the problems.
Read the git-bisect man page for full details, but the basics are

a) start with a known good point (v2.6.22? v2.6.23?) and
known bad point (HEAD, aka the most recent commit in
libata-dev.git#upstream)

b) build and boot kernels, marking each as known-good or
known-bad.

c) This process will systematically narrow down the problem
to a single git commit.

Regards,

Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 9/27/07, Tejun Heo <[EMAIL PROTECTED]> wrote:
> Tejun Heo wrote:
> > Torsten Kaiser wrote:
> >> Comparing the driver/ata directory from rc3-mm1 and rc4-mm1 the
> >> following change looked the most suspicions to me:
> >> http://git.kernel.org/?p=linux/kernel/git/jgarzik/libata-dev.git;a=blobdiff;f=drivers/ata/sata_sil24.c;h=3dcb223117be9739ee04d70b6bfc776a4b839a3f;hp=e0cd31aa8002350add53ba6ff07493e503275244;hb=020bc1bd8d369a77bd9379cd9763ac0057651753;hpb=8d4bdf8087e682df98bdb856f6ad451bf6d597e7
> >>
> >> That after rc4-mm1 the sata_sil24.c did not change anymore also
> >> matches the occurrence of the error.
> >>
> >> To confirm my theorie I exchanged the sata_sil24.c from rc8-mm1 with
> >> the version from rc3-mm1.
> >> I was able to boot the resulting kernel successfully 5 times, without
> >> the error happening again.
> >
> > Thanks a lot for chasing down the problem.  The changed code is address
> > initialization path and it's weird that it causes intermittent failures,
> > not a consistent one.
> >
> > Anyways, does the attached patch fix the problem?

I'm starting to *really* hate that bug.
My analysis was wrong, as I booted to modified 2.6.23-rc8-mm1 this
morning, that failed too. (Same error messages as -rc7-mm1 from the
first mail in this thread.)
So it's not that change that causes the breakage.

And I'm not really finding a good pattern to what boots fail and what work.
It seems to only fail, if I completely power off the system for
several hours. (Using the physical switch at the backside of the
powersupply, not the normal soft-off)

One of the five boots I tried yesterday, I also powered the system
completely off that way, but only leaving it off ~10..20 seconds
seemed not to trigger the bug.

But I still think that is not a hardware failure, as the -rc3-mm1
kernel never showed that error, even when I used it several times
after the first -rc4-mm1 failures.

> If not, can you add printk of iomap[SIL24_PORT_BAR], offset, initialized
> cmd_addr and scr_addr in the loop and see whether anything is different
> between when the driver works and fails.

Should I do this anyway?

I compared the dmesg form good and bad boots with -rc7-mm1 but could
not see any difference, so do you think that these additional
diagnostics could show a difference?
Or could you suggest any other debugging options I should try?

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 9/27/07, Tejun Heo [EMAIL PROTECTED] wrote:
Tejun Heo wrote:
Torsten Kaiser wrote:
Comparing the driver/ata directory from rc3-mm1 and rc4-mm1 the
following change looked the most suspicions to me:
http://git.kernel.org/?p=linux/kernel/git/jgarzik/libata-dev.git;a=blobdiff;f=drivers/ata/sata_sil24.c;h=3dcb223117be9739ee04d70b6bfc776a4b839a3f;hp=e0cd31aa8002350add53ba6ff07493e503275244;hb=020bc1bd8d369a77bd9379cd9763ac0057651753;hpb=8d4bdf8087e682df98bdb856f6ad451bf6d597e7

That after rc4-mm1 the sata_sil24.c did not change anymore also
matches the occurrence of the error.

To confirm my theorie I exchanged the sata_sil24.c from rc8-mm1 with
the version from rc3-mm1.
I was able to boot the resulting kernel successfully 5 times, without
the error happening again.

Thanks a lot for chasing down the problem. The changed code is address
initialization path and it's weird that it causes intermittent failures,
not a consistent one.

Anyways, does the attached patch fix the problem?

One of the five boots I tried yesterday, I also powered the system
completely off that way, but only leaving it off ~10..20 seconds
seemed not to trigger the bug.

But I still think that is not a hardware failure, as the -rc3-mm1
kernel never showed that error, even when I used it several times
after the first -rc4-mm1 failures.

If not, can you add printk of iomap[SIL24_PORT_BAR], offset, initialized
cmd_addr and scr_addr in the loop and see whether anything is different
between when the driver works and fails.

Should I do this anyway?

Re: sata_sil24 broken since 2.6.23-rc4-mm1

2007-09-27 Thread Jeff Garzik

Torsten Kaiser wrote:

On 9/27/07, Tejun Heo [EMAIL PROTECTED] wrote:

Tejun Heo wrote:

Torsten Kaiser wrote:

That after rc4-mm1 the sata_sil24.c did not change anymore also
matches the occurrence of the error.

To confirm my theorie I exchanged the sata_sil24.c from rc8-mm1 with
the version from rc3-mm1.
I was able to boot the resulting kernel successfully 5 times, without
the error happening again.

Thanks a lot for chasing down the problem. The changed code is address
initialization path and it's weird that it causes intermittent failures,
not a consistent one.

Anyways, does the attached patch fix the problem?

One of the five boots I tried yesterday, I also powered the system
completely off that way, but only leaving it off ~10..20 seconds
seemed not to trigger the bug.

But I still think that is not a hardware failure, as the -rc3-mm1
kernel never showed that error, even when I used it several times
after the first -rc4-mm1 failures.

If not, can you add printk of iomap[SIL24_PORT_BAR], offset, initialized
cmd_addr and scr_addr in the loop and see whether anything is different
between when the driver works and fails.

Should I do this anyway?

I think since its a reproducible problem, I think it's easiest to get
you straight to git-bisect. In this case, that would be

1. Clone branch upstream of
git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git

a) start with a known good point (v2.6.22? v2.6.23?) and
known bad point (HEAD, aka the most recent commit in
libata-dev.git#upstream)

b) build and boot kernels, marking each as known-good or
known-bad.

c) This process will systematically narrow down the problem
to a single git commit.

Regards,

Jeff

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 9/27/07, Jeff Garzik [EMAIL PROTECTED] wrote:
 Torsten Kaiser wrote:
  I compared the dmesg form good and bad boots with -rc7-mm1 but could
  not see any difference, so do you think that these additional
  diagnostics could show a difference?
  Or could you suggest any other debugging options I should try?

 I think since its a reproducible problem, I think it's easiest to get
 you straight to git-bisect.  In this case, that would be

Sorry, but I don't think that will work.
It seems that I am able to reproduce the bug, but not reliable. And my
current best guess to make it happen involves the step leaf the
computer powered off for 8 hours.
I estimate that even with the 8 hour pause only at ~50% of the boots
one drive fails. So I have no safe point to mark a bisect step as
'good'.

 a) start with a known good point (v2.6.22? v2.6.23?) and
 known bad point (HEAD, aka the most recent commit in
 libata-dev.git#upstream)

Known good is for me 2.6.23-rc3-mm1, the first known bad is 2.6.23-rc4-mm1.
I will try to look at the diff between these revisions some more, but
the change in sata_sil24.c looked like a perfect match for the
symptoms I was seeing.

What I just noticed, as I wanted two re-add the drive to the RAID:
This time it was not sda, but sdb that was kicked. But otherwise the
errors are perfectly identical.

I will try to make a 2.6.23-rc3.5-mm1 to narrow it down some more...

Torsten
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

2007-09-27 Thread Tejun Heo

Torsten Kaiser wrote:
 Known good is for me 2.6.23-rc3-mm1, the first known bad is 2.6.23-rc4-mm1.
 I will try to look at the diff between these revisions some more, but
 the change in sata_sil24.c looked like a perfect match for the
 symptoms I was seeing.

I think the first thing to do here is to verify 2.6.23-rc3-mm1 still
works fine and my previous debug patch is pretty much meaningless if
address initialization failure isn't the cause.

Thanks.

-- 
tejun

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 9/27/07, Tejun Heo [EMAIL PROTECTED] wrote:
 Torsten Kaiser wrote:
  Known good is for me 2.6.23-rc3-mm1, the first known bad is 2.6.23-rc4-mm1.
  I will try to look at the diff between these revisions some more, but
  the change in sata_sil24.c looked like a perfect match for the
  symptoms I was seeing.

 I think the first thing to do here is to verify 2.6.23-rc3-mm1 still
 works fine and my previous debug patch is pretty much meaningless if
 address initialization failure isn't the cause.

After the first trouble with -rc4-mm1 I switched back to -rc3-mm1. I
booted that kernel 7 times over 4 days and never had trouble. (Before
-rc4-mm1 came out, I used -rc3-mm1 for over a week)

So in case of -rc3-mm1 I'm pretty sure that it works.

Not completely sure is if 2.6.23-rc7-sglist kernel works. I booted
that 9 times, but from a quick look in /var/log/messages, I might not
have hit the correct situation to trigger the error.
That kernel is vanilla 2.6.23-rc7 plus the patch from
http://www.kernel.org/pub/linux/kernel/people/tomo/misc/v2.6.23-rc7-sglist-arch.diff.bz2
( http://marc.info/?l=linux-idem=119055574826083w=2 )

Torsten
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

Tejun Heo wrote:
> Torsten Kaiser wrote:
>> Comparing the driver/ata directory from rc3-mm1 and rc4-mm1 the
>> following change looked the most suspicions to me:
>> http://git.kernel.org/?p=linux/kernel/git/jgarzik/libata-dev.git;a=blobdiff;f=drivers/ata/sata_sil24.c;h=3dcb223117be9739ee04d70b6bfc776a4b839a3f;hp=e0cd31aa8002350add53ba6ff07493e503275244;hb=020bc1bd8d369a77bd9379cd9763ac0057651753;hpb=8d4bdf8087e682df98bdb856f6ad451bf6d597e7
>>
>> That after rc4-mm1 the sata_sil24.c did not change anymore also
>> matches the occurrence of the error.
>>
>> To confirm my theorie I exchanged the sata_sil24.c from rc8-mm1 with
>> the version from rc3-mm1.
>> I was able to boot the resulting kernel successfully 5 times, without
>> the error happening again.
> 
> Thanks a lot for chasing down the problem.  The changed code is address
> initialization path and it's weird that it causes intermittent failures,
> not a consistent one.
> 
> Anyways, does the attached patch fix the problem?

If not, can you add printk of iomap[SIL24_PORT_BAR], offset, initialized
cmd_addr and scr_addr in the loop and see whether anything is different
between when the driver works and fails.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

Torsten Kaiser wrote:
> Comparing the driver/ata directory from rc3-mm1 and rc4-mm1 the
> following change looked the most suspicions to me:
> http://git.kernel.org/?p=linux/kernel/git/jgarzik/libata-dev.git;a=blobdiff;f=drivers/ata/sata_sil24.c;h=3dcb223117be9739ee04d70b6bfc776a4b839a3f;hp=e0cd31aa8002350add53ba6ff07493e503275244;hb=020bc1bd8d369a77bd9379cd9763ac0057651753;hpb=8d4bdf8087e682df98bdb856f6ad451bf6d597e7
> 
> That after rc4-mm1 the sata_sil24.c did not change anymore also
> matches the occurrence of the error.
> 
> To confirm my theorie I exchanged the sata_sil24.c from rc8-mm1 with
> the version from rc3-mm1.
> I was able to boot the resulting kernel successfully 5 times, without
> the error happening again.

Thanks a lot for chasing down the problem.  The changed code is address
initialization path and it's weird that it causes intermittent failures,
not a consistent one.

Anyways, does the attached patch fix the problem?

-- 
tejun
diff --git a/drivers/ata/sata_sil24.c b/drivers/ata/sata_sil24.c
index 3831920..dc3ddcb 100644
--- a/drivers/ata/sata_sil24.c
+++ b/drivers/ata/sata_sil24.c
@@ -1117,6 +1117,7 @@ static int sil24_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
 
host->ports[i]->ioaddr.cmd_addr = port;
host->ports[i]->ioaddr.scr_addr = port + PORT_SCONTROL;
+   ata_std_ports(>ioaddr);
 
ata_port_pbar_desc(ap, SIL24_HOST_BAR, -1, "host");
ata_port_pbar_desc(ap, SIL24_PORT_BAR, offset, "port");

sata_sil24 broken since 2.6.23-rc4-mm1

2007-09-26 Thread Torsten Kaiser

As reported in the "2.6.23-rc4-mm1"-thread and the "What's in
linux-2.6-block.git for 2.6.24"-thread I'm having trouble that
sometimes on bootup one drive from the SiI-3132 throws errors and
becomes inaccesible.

The latest kernel I have seen this error was 2.6.23-rc7-mm1.
>From 7 boots 2 times the following happend:

Sep 25 07:42:11 treogen [   33.81] md1: bitmap initialized from
disk: read 10/10 pages, set 0 bits
Sep 25 07:42:11 treogen [   33.81] created bitmap (145 pages) for device md1
Sep 25 07:42:11 treogen [   63.91] ata1.00: exception Emask 0x0
SAct 0x1 SErr 0x0 action 0x6 frozen
Sep 25 07:42:11 treogen [   63.91] ata1.00: cmd
61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
Sep 25 07:42:11 treogen [   63.91]  res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 25 07:42:11 treogen [   63.91] ata1.00: status: {DRDY }
Sep 25 07:42:11 treogen [   63.91] ata1: hard resetting link
Sep 25 07:42:11 treogen [   66.21] ata1: softreset failed (port not ready)
Sep 25 07:42:11 treogen [   66.21] ata1: reset failed (errno=-5),
retrying in 8 secs
Sep 25 07:42:11 treogen [   73.91] ata1: hard resetting link
Sep 25 07:42:11 treogen [   76.21] ata1: softreset failed (port not ready)
Sep 25 07:42:11 treogen [   76.21] ata1: reset failed (errno=-5),
retrying in 8 secs
Sep 25 07:42:11 treogen [   83.91] ata1: hard resetting link
Sep 25 07:42:11 treogen [   86.21] ata1: softreset failed (port not ready)
Sep 25 07:42:11 treogen [   86.21] ata1: reset failed (errno=-5),
retrying in 33 secs
Sep 25 07:42:11 treogen [  118.91] ata1: limiting SATA link speed
to 1.5 Gbps
Sep 25 07:42:11 treogen [  118.91] ata1: hard resetting link
Sep 25 07:42:11 treogen [  121.21] ata1: softreset failed (port not ready)
Sep 25 07:42:11 treogen [  121.21] ata1: reset failed, giving up
Sep 25 07:42:11 treogen [  121.21] ata1.00: disabled
Sep 25 07:42:11 treogen [  121.21] ata1: EH complete
Sep 25 07:42:11 treogen [  121.21] sd 0:0:0:0: [sda] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Sep 25 07:42:11 treogen [  121.21] end_request: I/O error, dev
sda, sector 625137161
Sep 25 07:42:11 treogen [  121.21] md: super_written gets
error=-5, uptodate=0
Sep 25 07:42:11 treogen [  121.21] raid5: Disk failure on sda2,
disabling device. Operation continuing on 2 devices

Comparing the driver/ata directory from rc3-mm1 and rc4-mm1 the
following change looked the most suspicions to me:
http://git.kernel.org/?p=linux/kernel/git/jgarzik/libata-dev.git;a=blobdiff;f=drivers/ata/sata_sil24.c;h=3dcb223117be9739ee04d70b6bfc776a4b839a3f;hp=e0cd31aa8002350add53ba6ff07493e503275244;hb=020bc1bd8d369a77bd9379cd9763ac0057651753;hpb=8d4bdf8087e682df98bdb856f6ad451bf6d597e7

That after rc4-mm1 the sata_sil24.c did not change anymore also
matches the occurrence of the error.

To confirm my theorie I exchanged the sata_sil24.c from rc8-mm1 with
the version from rc3-mm1.
I was able to boot the resulting kernel successfully 5 times, without
the error happening again.

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

sata_sil24 broken since 2.6.23-rc4-mm1

2007-09-26 Thread Torsten Kaiser

As reported in the 2.6.23-rc4-mm1-thread and the What's in
linux-2.6-block.git for 2.6.24-thread I'm having trouble that
sometimes on bootup one drive from the SiI-3132 throws errors and
becomes inaccesible.

The latest kernel I have seen this error was 2.6.23-rc7-mm1.
From 7 boots 2 times the following happend:

Sep 25 07:42:11 treogen [   33.81] md1: bitmap initialized from
disk: read 10/10 pages, set 0 bits
Sep 25 07:42:11 treogen [   33.81] created bitmap (145 pages) for device md1
Sep 25 07:42:11 treogen [   63.91] ata1.00: exception Emask 0x0
SAct 0x1 SErr 0x0 action 0x6 frozen
Sep 25 07:42:11 treogen [   63.91] ata1.00: cmd
61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
Sep 25 07:42:11 treogen [   63.91]  res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 25 07:42:11 treogen [   63.91] ata1.00: status: {DRDY }
Sep 25 07:42:11 treogen [   63.91] ata1: hard resetting link
Sep 25 07:42:11 treogen [   66.21] ata1: softreset failed (port not ready)
Sep 25 07:42:11 treogen [   66.21] ata1: reset failed (errno=-5),
retrying in 8 secs
Sep 25 07:42:11 treogen [   73.91] ata1: hard resetting link
Sep 25 07:42:11 treogen [   76.21] ata1: softreset failed (port not ready)
Sep 25 07:42:11 treogen [   76.21] ata1: reset failed (errno=-5),
retrying in 8 secs
Sep 25 07:42:11 treogen [   83.91] ata1: hard resetting link
Sep 25 07:42:11 treogen [   86.21] ata1: softreset failed (port not ready)
Sep 25 07:42:11 treogen [   86.21] ata1: reset failed (errno=-5),
retrying in 33 secs
Sep 25 07:42:11 treogen [  118.91] ata1: limiting SATA link speed
to 1.5 Gbps
Sep 25 07:42:11 treogen [  118.91] ata1: hard resetting link
Sep 25 07:42:11 treogen [  121.21] ata1: softreset failed (port not ready)
Sep 25 07:42:11 treogen [  121.21] ata1: reset failed, giving up
Sep 25 07:42:11 treogen [  121.21] ata1.00: disabled
Sep 25 07:42:11 treogen [  121.21] ata1: EH complete
Sep 25 07:42:11 treogen [  121.21] sd 0:0:0:0: [sda] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Sep 25 07:42:11 treogen [  121.21] end_request: I/O error, dev
sda, sector 625137161
Sep 25 07:42:11 treogen [  121.21] md: super_written gets
error=-5, uptodate=0
Sep 25 07:42:11 treogen [  121.21] raid5: Disk failure on sda2,
disabling device. Operation continuing on 2 devices

Comparing the driver/ata directory from rc3-mm1 and rc4-mm1 the
following change looked the most suspicions to me:
http://git.kernel.org/?p=linux/kernel/git/jgarzik/libata-dev.git;a=blobdiff;f=drivers/ata/sata_sil24.c;h=3dcb223117be9739ee04d70b6bfc776a4b839a3f;hp=e0cd31aa8002350add53ba6ff07493e503275244;hb=020bc1bd8d369a77bd9379cd9763ac0057651753;hpb=8d4bdf8087e682df98bdb856f6ad451bf6d597e7

That after rc4-mm1 the sata_sil24.c did not change anymore also
matches the occurrence of the error.

To confirm my theorie I exchanged the sata_sil24.c from rc8-mm1 with
the version from rc3-mm1.
I was able to boot the resulting kernel successfully 5 times, without
the error happening again.

Torsten
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_sil24 broken since 2.6.23-rc4-mm1

Torsten Kaiser wrote:
 Comparing the driver/ata directory from rc3-mm1 and rc4-mm1 the
 following change looked the most suspicions to me:
 http://git.kernel.org/?p=linux/kernel/git/jgarzik/libata-dev.git;a=blobdiff;f=drivers/ata/sata_sil24.c;h=3dcb223117be9739ee04d70b6bfc776a4b839a3f;hp=e0cd31aa8002350add53ba6ff07493e503275244;hb=020bc1bd8d369a77bd9379cd9763ac0057651753;hpb=8d4bdf8087e682df98bdb856f6ad451bf6d597e7
 
 That after rc4-mm1 the sata_sil24.c did not change anymore also
 matches the occurrence of the error.
 
 To confirm my theorie I exchanged the sata_sil24.c from rc8-mm1 with
 the version from rc3-mm1.
 I was able to boot the resulting kernel successfully 5 times, without
 the error happening again.

Thanks a lot for chasing down the problem.  The changed code is address
initialization path and it's weird that it causes intermittent failures,
not a consistent one.

Anyways, does the attached patch fix the problem?

-- 
tejun
diff --git a/drivers/ata/sata_sil24.c b/drivers/ata/sata_sil24.c
index 3831920..dc3ddcb 100644
--- a/drivers/ata/sata_sil24.c
+++ b/drivers/ata/sata_sil24.c
@@ -1117,6 +1117,7 @@ static int sil24_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
 
host-ports[i]-ioaddr.cmd_addr = port;
host-ports[i]-ioaddr.scr_addr = port + PORT_SCONTROL;
+   ata_std_ports(ap-ioaddr);
 
ata_port_pbar_desc(ap, SIL24_HOST_BAR, -1, host);
ata_port_pbar_desc(ap, SIL24_PORT_BAR, offset, port);

Re: sata_sil24 broken since 2.6.23-rc4-mm1