On 8/20/2025 4:51 PM, Marek Marczykowski-Górecki wrote:
On Wed, Aug 20, 2025 at 04:26:14PM +0300, Timo Teras wrote:
On Wed, 20 Aug 2025 15:38:12 +0300
"Lifshits, Vitaly" <[email protected]> wrote:
On 8/20/2025 9:57 AM, Timo Teras wrote:
Thanks for adding this!
However, as a user, I find it inconvenient if the default setting
results in a subtly broken system on a device I just from a store.
Since this affects devices from multiple large vendors, would it
be possible to add some kind of quirk mechanism to automatically
enable this on known "bad" systems. Perhaps something based on
the DMI or other system specific information. Could something
like this be implemented?
At least in my use case I have multiple e1000e using laptops on
the same link partner working, and only one broken device for
which I reported this issue. So at least on my experience the
issue relates to specific system primarily (perhaps also
requiring a specific link partner for the issue to show up).
Unfortunately, there is no visible configuration that allows the
driver to reliably identify problematic systems.
If in the future we find such data, then we can improve the
workaround and make it automatic.
At present, the user-controlled interface is the best we have.
Could you look at:
- drivers/hid/i2c-hid/i2c-hid-dmi-quirks.c
- drivers/soundwire/dmi-quirks.c
These use dmi_first_match() to match the DMI information of the
system and then apply quirks based on the matching per-system data.
Having similar mechanism in e1000e should be possible, right?
I am happy to provide the needed DMI information from my system if
this works out.
Timo
Hi Timo,
At the moment, we have no clear knowledge as to which systems may be
affected, and what common characteristics they share.
We are working with vendors to try to narrow it down.
You are most welcome to share DMI information from your system. It
can help with further investigation.
However, maintaining a DMI quirk for every single system for which an
issue has been reported is not feasible. Trying to deduce a pattern
from a handful of data points can lead to it being too broad or too
narrow. Furthermore, it may set up expectations of updating the quirk
every time another user comes and says 'your default setting does not
work for me'. This can quickly escalate out of control, and generally
seems like the wrong approach.
Ultimately, vendors are best positioned to manage this, as they know
which of their systems require this parameter. If a list were to be
maintained, I’d suggest something similar to what Mario proposed for
Dell platforms a few years ago for a different issue:
https://patchwork.ozlabs.org/project/netdev/patch/[email protected]/
For now, I prefer not to delay the current patch, acknowledging that
finding a better solution may take time.
Thank you for the continued investigation on the issue!
But I find this commit to not fix the reported regression. Nothing
changes without additional admin/user changes. Things used to work and
the added/modified K1 support thing is causing a regression.
Ubuntu has already reverted the offending patch due to complaints in
some flavors:
https://patchwork.ozlabs.org/project/ubuntu-kernel/patch/[email protected]/
https://bugs.launchpad.net/bugs/2115393
https://www.mail-archive.com/[email protected]/msg551129.html
Qubes OS also has this change reverted in default kernel, for the same
reason:
https://github.com/QubesOS/qubes-issues/issues/9896
https://github.com/QubesOS/qubes-linux-kernel/commit/4fb8c96dd7bd73dda00a89d026b6ebefff939a67
We've got several reports of the regression caused by the "e1000e:
change k1 configuration on MTP and later platforms", and _none_
complains after reverting it. And we do have many users on MTL or newer.
This is what I ended up also doing as it reliably fixes things on every
model I have, and has not caused any of them to have any other issues
(including packet loss).
At least mainstream Dell Pro and HP Zbook laptops have been reported to
be broken. See:
https://lists.openwall.net/netdev/2025/07/01/57
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20250623/048860.html
This seems to be the same issue:
https://bugzilla.kernel.org/show_bug.cgi?id=218642
So some questions at this point:
If the added K1 configuration does not work and causes regressions,
could it be reverted and added back when a k1 configuration change that
can determine the affected systems is ready?
Could you explain the commit "e1000e: change k1 configuration on MTP
and later platforms" more? What does it fix? My understanding it is
"minor packet loss that may affect some machines"?
How many machines / what kind of scenario is affected? Is it fixing a
more serious issue than the regression it is causing?
The regression is completely defunct ethernet after unplugging cable.
My understanding is that the K1 change affects only power consumption.
Is this right? How much is the consumption difference? Would it rather
make sense to disable K1 by default on the potentially affected mac/phy
versions until a good common denominator is found?
Given the severity of the regression, I'd suggest something like the
above. Have functional configuration by default, and have an option to
potentially improve power consumption. Once criteria when it can be
safely enabled by default are figured out, then it's fine to apply the
improvement by default. But I'd rather have users with functional
ethernet, than slight power (or performance?) improvement at the cost of
completely breaking it for others...
On the other hand, do you think that asking to have a list of the few
currently known affected machines (until a simpler common denominator
can be found) too unreasonable? If the list seems to grow much, it
would be an indication that the default setting is wrong and changing
the defaults might be a good idea.
Let me know what info you'd need for such list.
Hi,
Thanks for your input — I think your points are valid.
As I mentioned earlier, fully reverting the patch "e1000e: change K1
configuration on MTP and later platforms" is not advisable.
In addition to a possibility of packet corruption, it also increases the
risk of PHY access failures, which can lead to the driver sporadically
failing probe() or resume() flows.
However, surveying the breadth of the systems suffering regressions with
this patch, it seems that indeed a safer approach is to have the
"disable K1" flag default as TRUE for the MTP and later.
This approach ensures:
No impact on legacy platforms.
Affected platforms will be protected both from the PHY access failures
resolved by the previous patch as well as the packet loss issues
introduced by it, only at the cost of a somewhat increased power
consumption.
I will send a V2 with this change.