"Reshetova, Elena" <elena.reshet...@intel.com> writes: >> "Reshetova, Elena" <elena.reshet...@intel.com> writes: >> >> >> "Reshetova, Elena" <elena.reshet...@intel.com> writes: >> >> >> >> 2>> Elena Reshetova <elena.reshet...@intel.com> writes: >> >> >> >> >> >> > refcount_t type and corresponding API should be >> >> >> > used instead of atomic_t when the variable is used as >> >> >> > a reference counter. This allows to avoid accidental >> >> >> > refcounter overflows that might lead to use-after-free >> >> >> > situations. >> >> >> >> >> >> In this patch you can see all of the uses of the count. >> >> >> What accidental refcount overflows are possible? >> >> > >> >> > Even if one can guarantee and prove that in the current implementation >> >> > there are no overflows possible, we can't say that for >> >> > sure for any future implementation. Bugs might always happen >> >> > unfortunately, but if we convert the refcounter to a safer >> >> > type we can be sure that overflows are not possible. >> >> > >> >> > Does it make sense to you? >> >> >> >> Not for code that is likely to remain unchanged for a decade no. >> > >> > Can we really be sure for any kernel code about this? And does it make >> > sense to trust our security on a fact like this? >> >> But refcount_t doesn't fix anything. At best it changes a bad bug to a >> less bad bug. So now my machine OOMS instead of allows a memory >> overwrite. It still doesn't work. > > Well, it is a step forward from security standpoint. OOMS is really hard > to exploit vs. memory overwrites. Pretty much all exploits need either > memory write or memory read, out of memory is really much harder to > exploit.
OOM in production is a denial of service attack. From a serverity point of view an OOM can be considered equivalent to a kernel panic and something that requires a box to reboot. >From a long term perspective I expect we will need to change all reference counters to a type where that is not saturating but instead fails to increment and returns an error. If we want to keep a system functioning in the face of maxing out a reference count that is the only way it can truly be done. >> Plus refcount_t does not provide any safety on the architectures where >> it is a noop. > > Not sure I understood this. What do you mean by "noop"? > refcount_t is currently architecture independent. noop being short for no operation. I believe there were some/most archicture implementations that define refcount_t to atomic_t out of performance concerns. I know I saw patches fly by to that effect. On an architecture where refcount_t is equivalent to atomic_t the change in these patches is a noop. >> >> This looks like a large set of unautomated changes without any real >> >> thought put into it. >> > >> > We are soon into the end of the first year that we started to look into >> > refcounter overflow/underflow problem and coming up this far was >> > not easy enough (just check all the millions of emails on kernel-hardening >> > mailing list). Each refcount_t conversion candidate was first found by >> > Coccinelle >> > analysis and then manually checked and converted. The story of >> > refcount_t API and all discussions go even further. >> > So you can't really claim that there is no " thought put into it " :) >> >> But the conversion of the instance happens without thought and manually. >> Which is a good recipe for typos. Which is what I am saying. >> >> There have been lots of conversions like that in the kernel and >> practically every one has introduced at least one typo. > > What do you exactly mean by "typo"? Typos should be detected at these > stages: An unintentional mistake. The term "thinko" might be more what I am thinking of. Many human mistakes are not slips of the fingers but accidentally giving the wrong command at some level. > 1) typos like wrong function name etc. can be found at compile time > (trust me I have found a number of these on the very first iteration with > patches) > 2) "typos" (not sure if it is correct to call them typos) like usage of wrong > refcount_t > API vs. original atomic_t API can be found during internal reviews or > reviews by maintainers This I worry about most as the mental distance from xxx_inc to xxx_dec can be very short. > 3) much bigger problem is actually not any typos, but hidden issues that show > up only > in run-time that detect underflows/overflows or inability to increment > from zero. > These only are nasty, but given that refcount_t WARNs left and right about > them, > we can detect them fast. Which means I worry about those less. > I don't know what is a better recipe for doing API changes like this? > Do you have any suggestions? I would think a semantic patch targeting a specific lock would be less error prone. I would think that the same semantic patch could be used from lock to lock with just a change of the lock that is being targeted. I strongly suspect that would reduce the chance of accident when dealing with a particular API and being scripted would increase the confidence in the changes. >> So from an engineering standpoint it is a very valid question to ask >> about. And I find the apparent insistence that you don't make typos >> very disturbing. >> >> > That almost always results in a typo somewhere >> >> that breaks things. >> >> >> >> So there is no benefit to the code, and a non-zero chance that there >> >> will be a typo breaking the code. >> > >> > The code is very active on issuing WARNs when anything goes wrong. >> > Using this feature we have not only found errors in conversions, but >> > sometimes errors in code itself. So, any bug would be actually much >> > faster visible than using old atomic_t interface. >> > >> > In addition by default refcount_t equals to atomic, which also gives a >> > possibility to make a softer transition and catch all related bugs in >> > couple >> > of cycles when enabling CONFIG_REFCOUNT_FULL. >> >> But if you make a typo and change one operation for another I don't see >> how any of that applies. > > It is hard to make a "typo" to change one operation to another. It is not a > one-two char mismatch error. Those might be more properly "thinkos" but they do happen. > When doing these patches we followed the > logic of "less code changes - better" (since less chances of making mistake), > so if in some cases functions are changed (like from atomic_sub to > refcount_sub_and_test(), or from atomic_inc_not_zero() to atomic_inc() etc.) > there was a reason for making it and the change wasn't automatic and without > thinking at all. Again, we do have our maintainers also to catch if a change > that > we did doesn't actually work for them right? In general a maintainers job is make certain the appropriate code review and due dilligence has happened not nececessarily to perform that code review themselves. Changing from atomoic_inc_not_zero to a simple refcount_inc really troubles me because it makes it harder to switch to something fully correct. Eric