On Tue, 18 Oct 2011 11:37:38 -0200, Eugeni Dodonov wrote:
> On 10/18/2011 11:14, Jean Delvare wrote:
> > Hi Dave, Eugeni, Alex,
> >
> > On Tue, 18 Oct 2011 11:02:00 +0100, Dave Airlie wrote:
> >>> This allows to avoid talking to a non-existent bus repeatedly until we
> >>> finally timeout. The non-existent bus is signalled by -ENXIO error,
> >>> provided by i2c_algo_bit:bit_doAddress call.
> >>>
> >>> As the advantage of such change, all the other routines which use
> >>> drm_get_edid would benefit for this timeout.
> >>>
> >>> This change should fix
> >>> https://bugs.freedesktop.org/show_bug.cgi?id=41059 and improve overall
> >>> edid detection timing by 10-30% in most cases, and by a much larger margin
> >>> in case of phantom outputs.
> >>
> >> Jean, Alex,
> >>
> >> I'm thinking of thowing this into -next, any objections?
> >
> > This seems to be the wrong place for the test. The code below is really
> > testing for non-responsive (possibly not present) EDID, NOT
> > "non-existent adapter". Non-existent adapter should be checked in the
> > firmware if possible, or failing that, by testing the bus lines at bus
> > creation time, as i2c_algo_bit.bit_test=1 is doing. While this setting
> > has been known to cause trouble recently (not per se but because it was
> > triggering a bug somewhere else in the radeon driver), it might still
> > have value, and could be changed to a per-adapter setting by exporting
> > the test function (I have a patch ready doing exactly this) and letting
> > video drivers test their I2C buses and discard the non-working ones if
> > they want.
> >
> > I am worried that the patch below will cause regressions on other
> > machines. There's a comment right before the loop which reads:
> >
> >     /* The core i2c driver will automatically retry the transfer if the
> >      * adapter reports EAGAIN. However, we find that bit-banging transfers
> >      * are susceptible to errors under a heavily loaded machine and
> >      * generate spurious NAKs and timeouts. Retrying the transfer
> >      * of the individual block a few times seems to overcome this.
> >      */
> >
> > So the retries are there for a reason, and -ENXIO is exactly what you
> > get on spurious NAKs.
> 
> You are right about the bit_test=1 testing, my initial attempt at the
> fix did exactly that
> (https://bugs.freedesktop.org/show_bug.cgi?id=41059, comments 14 and 15).
> 
> The problem is that for some buses, namely HDMI ones in my testing,
> bit_test-like tests always consider them as non-existent when the
> connectivity is not setup; and as working when it is.

Just to clarify: by "connectivity is setup", do you mean code in the
driver setting the GPIO pin direction etc., or a display being
connected to the graphics card?

I admit I am a little surprised. I2C buses should have their lines up
by default, thanks to pull-up resistors, and masters and slaves should
merely drive the lines low as needed. The absence of slaves should have
no impact on the line behavior. If the Intel graphics chips (or the
driver) rely on the display to hold the lines high, I'd say this is a
seriously broken design.

As a side note, I thought that HDMI had the capability of presence
detection, so there may be a better way to know if a display is
connected or not, and this information could used to dynamically
instantiate and delete the I2C buses? That way we could skip probing
for EDID when there is no chance of success.

In the specific case of the user report that started this discussion,
the card has no HDMI port to start with, so it seems weird to even
attempt to create a DDC bus for HDMI. If there no way the i915 driver
can detect this case and skip the whole HDMI setup? As I understand it
the radeon driver is able to do that. If the user report is an isolated
case of faulty BIOS/firmware, you could consider handling it
specifically (as the radeon driver does, too.)

> So in any case, we
> could not just blacklist them - when they do, they are gone for good,
> and won't work anymore, and we need to keep re-trying them at each attempt.
> 
> And in case of continuous pre-testing for the stuck bits and like when
> driver attempts to get the edid (for example, when xrandr is run), we
> still hit the same issue - the drm_do_probe_ddc_edid will continue to
> retry the attempts until it reaches the maximum number of retries. E.g., there
> is no way to tell drm_do_probe_ddc_edid to treat any return code as a
> permanent failure and make it give up faster.

Well, you could always do manual line testing of the I2C bus _before_
calling drm_do_probe_ddc_edid()? And skip the call if the test fails
(i.e. the bus isn't ready for use.) As said before, I am willing to
export bit_test if it helps any. I've attached a patch doing exactly
this. Let me know if you want me to commit it.

Your proposed patch is better than I first thought. I had forgotten
that i2c-algo-bit tries up to 3 times to get a first ack from a slave.
So if i2c_transfer returns -ENXIO, this means that i2c-algo-bit has
already attempted 3 times to contact the slave, with no reply, so
there's probably no point going on. A communication problem with a
present slave would have returned a different error code.

Without your patch, a missing slave would cause 3 * 5 = 15 retries,
which is definitely too much.

That being said, even then the whole probe sequence shouldn't exceed
10 ms, which I wouldn't expect a user to notice. The user-reported 4
second delay when running xrandr can't be caused by this. 4 seconds for
15 attempts is 250 ms per attempt, this looks like a timeout due to
non-functional bus, not a nack. Not that 250 ms is 1 jiffy for HZ=250,
which I guess was what the reporting user was running. So even with
your patch, there's still 750 ms to wait, I don't think this is
acceptable. You really have to fix that I2C bus or skip probing it.

> It could be fixed either in per-driver fashion, like I did with the other 
> patch
> (which also tests for -ENXIO); or in a generic way, in DRM. I couldn't 
> reproduce
> any false positives coming from -ENXIO on i915 driver, but perhaps it is
> easier with radeon? Do you have any specific workload which trick the
> driver into generating spurious NAKs by a chance?

I'm not sure if it is related to the driver, graphics chip or display.
The way I read the comment in the code, the cause of the problem would
rather be CPU load. And as a matter of fact, in bit-banging mode, the
I2C clock is generated by the CPU itself, and we have no guarantee
that it can be done on time. For example if interrupts fire during the
I2C transfer and they take time to be processed, we might exceed the
standard time constraints and slaves might consider that the transfer
is aborted. This was discussed before on the linux-i2c list [1], but no
solution was found yet and I'm not sure if such a solution exists. If
anyone has ideas, they are welcome.

[1] http://marc.info/?l=linux-i2c&m=129250841025737&w=2

I don't think I ever hit the problem (but with the retry code in place,
that's hard to tell for sure.) Best would be to ask the developers
involved in 4819d2e4310796c4e9eef674499af9b9caf36b5a, which added the
retry mechanism: Chris Wilson and Michael Buesch (both Cc'd.) Chris,
Michael, do you know of ways to reproduce the issue? Can you please
also confirm that the error code you were receiving was not -ENXIO?

Note that the problem is more likely to happen with slow masters,
because (1) every transaction takes longer and thus has a higher
probability to be hit by interrupts and (2) long delays mean smaller
margins to the spec limits, so interrupts are more likely to cause
trouble. I see that both radeon_i2c and intel_i2c set udelay to 20 ?s,
which means a 25 kbps I2C bus. In general it is recommended to not
drive the I2C bus below 10 kbps (that's even a hard limit for SMBus
compliance.) nouveau_i2c is even worse with udelay = 40 ?s which is
equivalent to a 12.5 kbps I2C bus, very close to the low limit.

I would recommend lowering udelay to at least 10 ?s (50 kbps I2C bus).
It might even make sense to lower even more, maybe down to 5 ?s to hit
the max speed of standard I2C (100 kbps). Compliant slaves can slow
down the clock on the fly if they need more time (but I wouldn't expect
EEPROMs to need time as they don't have anything to process.) Not only
this may help avoid the problems retries attempt to work around, but it
will also make EDID block read faster (both on success and failure). At
25 kbps, reading a 128-byte EDID block takes about 50 ms. This could be
lowered to 25 ms with udelay = 10, or even 12 ms if driving the I2C
clock at max speed.

FWIW, framebuffer drivers radeon, cyber2000, i810, matrox, s3, savage,
tdfx and via all have udelay = 10 or lower, so it can't be that bad.
I'll submit a patch for the 3 KMS DRM drivers.

> > One thing which may make sense would be to make the retry count a
> > module parameter, instead of a hard-coded value, so that users can
> > lower the value if they care more about boot time than reliability. But
> > again, ideally problematic buses would not have been created in the
> > first place.
> 
> Or perhaps it would be possible to have any error code coming from
> i2c_transfer to signal a permanent error, which should not be retried..
> what do you think?

i2c_transfer doesn't do anything by itself, it delegates all the work
to the i2c bus driver or algorithm (in this case i2c-algo-bit and
specifically bit_xfer() in this module.) Unfortunately there is no way
to tell a permanent error from a transient one. What we can do OTOH is
differentiate between the different error types and have adjusted retry
strategies. As your patch does, and i2c-algo-bit too.

I see a number of issues in i2c-algo-bit as I read through the code.
First issue, it doesn't handle multi-master setups, i.e. no busy bus
detection and no arbitration loss detection. I don't know if
multi-master I2C topologies are possible on graphics cards?

Second issue, it returns error codes (-EREMOTEIO) not documented in
Documentation/i2c/fault-codes. These should be converted to comply.
I'll look into it.

-- 
Jean Delvare

Reply via email to