On 10/11/07, Tejun Heo <[EMAIL PROTECTED]> wrote:
> Torsten Kaiser wrote:
> > Looking closer at
> > http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git;a=commitdiff;h=ec6fdded4d76aa54aa57341e5dfdd61c507b1dcd
> > the change to libata.h seems bogus :
> >
> > in ata_qc_first_sg:
> > old                                new
> > return qc->__sg                    return qc->__sg
> > qc->__sg - qc->__sg == 0           qc->n_iter=0
> > -> sg - qc->__sg corresponds to qc->n_iter
> >
> > in ata_qc_next_sg:
> > sg++;                              sg_next(sg); qc->n_iter++;
> > sg - qc->__sg < qc->n_elem         qc->n_iter < qc->nelem
> > -> sg - qc->__sg corresponds to qc->n_iter
> >
> > but in ata_sg_is_last:
> > (sg - qc->__sg) +1 == qc->n_elem   qc->n_iter == qc->n_elem
> > if sg - qc->__sg corresponds to qc->n_iter then shoudn't it be
> > qc->n_iter+1 == qc->n_elem?
> >
> > That missing +1 would explain, why the SGE_TRM never gets set.
>
> Thanks a lot for tracking this down.  Does changing the above code fix
> your problem?

I did not try it.
I'm not an libata expert and while this change looks suspicios, I
can't be 100% sure if that change was intended.
And I did not want to experiment this deep in the code and risk
corrupting the hole drive.

> Jens, Torsten's analysis looks correct && depending on qc state (n_iter)
> during iteration doesn't look like a good idea.  Those iterators are not
> supposed to have side effects.  Would it be difficult to implement
> sg_last() test?
>
> > And it would fit the symptoms, that the boot would fail at random. If
> > the "correct" garbage was in place to where the sglist runs off it
> > hangs the drive.
> > And that would even fit the two different errors that I only got one time 
> > each:
> > * a completely illegal access (PCI master abort while fetching SGT)
> > * wrong alignment of the SGT (SGT no on qword boundary)
> > At that that times the garbage seemed to point invalid addresses.
> >
> > But I'm still not understanding, how the kernel could only fail
> > sometimes at bootup, but after that working without any visible
> > errors? Is the sil-chip rather intelligent about detecting corrupted
> > sglists and silently ignoring them?
>
> I have no idea why it fails only sometimes.

And that is, why I'm so unsure.
The error looks to serious to only cause random failures on one of two
drives on bootup.
I never had trouble with the remaining drive on the SiI-chip or both
drives if one got killed during booting.

I'm guessing that leaving the computer powered down long enough fills
the RAM with a special pattern that really hangs the drive, while
normaly it would just reject the invalid data. (I have ECC-RAM, does
this matter?)

Another guess might be that most of the time the Sil-chip correctly
terminates after the transfer-length is reached, even if SGE_TRM is
missing...

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to