On 2017-10-12 16:42, Guenter Roeck wrote: > On 10/12/2017 07:32 AM, Peter Rosin wrote: >> On 2017-10-12 13:35, Peter Rosin wrote: >>> Hi! >>> >>> I have encountered an "interesting" bug. It silently corrupts data >>> and is generally nasty... >>> >>> On an I2C bus, driven by the at91 driver and DMA (an Atmel >>> sama5d31 chip), I have an 256 byte eeprom (NXP SE97BTP). I'm using >>> Linux v4.13. >>> >>> The at24 driver for the eeprom detects that the I2C adapter is >>> capable of I2C transactions and selects that over SMBus. Reads >>> are done in 128 byte chunks. However, sometimes there is some >>> kind of event that disturbs the transactions such that the very >>> last bit och the very last byte (and the following NACK and STOP) >>> of such chunks are delayed for a long time (the latest incident >>> shows 85ms on the scope). That is too long for the eeprom which >>> is expecting SMBus and times out after about 30 ms. When the >>> eeprom times out, it just releases the data line so that it is >>> pulled up high. The I2C driver does not notice this, and when it >>> finally gets going, it reads a one for the last bit instead of >>> the expected zero. Since it is the last byte of the read, a NACK >>> is expected and since the eeprom has timed out the NACK is there. >>> And the STOP condition also looks normal (expected since it is >>> generated by the driver itself). So, the driver has not noticed >>> anything funny. But the data is corrupted. >>> >>> I can work around this by disabling the SMBus timeout in the eeprom >>> with: >>> >>> i2cset -f 0 0x18 0x22 0x8100 >>> >>> But that is done on a different I2C address (the eeprom is on >>> address 0x50), since the chip is a combined temperature sensor and >>> eeprom, and the SMBus timeout bit is of course in a temperature >>> sensor register. >>> >>> HOWEVER, I fail to see how this is limited to my case with this >>> eeprom. Any SMBus chip with a timeout will suffer the same fate. >>> The real bug is that this happens without the driver noticing it. >>> And why is there a 85ms delay in the middle of the last byte? >>> Sure, I can see why there might be a delay before finishing up >>> with a STOP condition or between bytes if there needs to be some >>> DMA setup at some interval, but after the 7th bit of a byte? >>> >>> For a lot of transactions on the I2C bus there is no delay before >>> the last bit. And most of the time there is no delay for the >>> eeprom reads either; the delay only occurs when it feels like it. >>> >>> This does not feel good at all. >> >> I added some traces to i2c-at91.c and, AFAIU, it's the call to >> at91_twi_read_data_dma_callback that sometimes arrives later than >> desired. Once the callback runs, the transfer completes swiftly. >> >> After reading the comments in that driver I suppose the HW holds >> on to the last data-bit until it knows whether to ACK or NACK in >> the following bit. >> >> But given this, it is questionable if this driver/HW combo can >> claim support for SMBus. But then again, I expect many things >> suffer from similar scheduling delays (presumably that's what's >> going on) so this is probably not special to i2c-at91.c... >> >> Since this is probably a very generic problem and I just happened >> to hit it for the eeprom, I wonder if it would be ok to add a >> workaround, as below, to the temperature sensor driver part of this >> chip? (with suitable comments, defines for the constants etc - >> setting the 0x0080 bit in reg 0x22 disables the SMBus timeout) >> >> Cheers, >> Peter >> >> diff --git a/drivers/hwmon/jc42.c b/drivers/hwmon/jc42.c >> index 1bf22eff0b08..3e72bd8e06d1 100644 >> --- a/drivers/hwmon/jc42.c >> +++ b/drivers/hwmon/jc42.c >> @@ -416,6 +416,13 @@ static int jc42_detect(struct i2c_client *client, >> struct i2c_board_info *info) >> if ((cap & 0xff00) || (config & 0xf800)) >> return -ENODEV; >> >> + if (manid == NXP_MANID && (devid & SE97_DEVID_MASK) == SE97_DEVID) { >> + int smbus = i2c_smbus_read_word_swapped(client, 0x22); >> + if (smbus < 0) >> + return -ENODEV; >> + i2c_smbus_write_word_swapped(client, 0x22, smbus | 0x0080); >> + } >> + > > Outch. Not like that; this would affect every board with this chip, not just > this one.
That was kind of the intention. Silent corruption is nasty. But ok, let's do the opt-in thing. > We would need something like a DT property to do that (smbus-timeout-disable > is used > in other drivers). > > .. and definitely not in the detect function. This would have to be done in > probe. Arrgh, I did test it, but not properly. The patch passed my tests because I didn't power-cycle the board and the already disabled timeout stayed in place for the next boot. I just assumed jc42_detect was called from probe and it was convenient to have the manufacturer etc already at hand. Silly silly. I'll send a proper patch tomorrow with these things taken care of... Cheers, peda