On Thu, 2023-10-26 at 21:38 +0200, Theo Buehler wrote:
> On Thu, Oct 26, 2023 at 11:51:00AM +0200, Martijn van Duren wrote:
> > This case is covered by the new regress' backend_get_toofew and
> > backend_get_toomany tests. However, even with MALLOC_OPTIONS cranked
> > to the max it's really hard to trigger (I had to run
> > backend_get_wrongorder, backend_get_toofew, backend_get_toomany
> > sequentially in a tight loop killing snmpd between iterations for the
> > best chance).
> > 
> > When we receive an invalid varbindlist in a response we set the invalid
> > variable. This in turn calls appl_varbind_error(), but the avi_state
> > of the varbinds remains in APPL_VBSTATE_PENDING. Directly following
> > we call appl_request_downstream_free(), which sees that the varbinds
> > haven't been resolved, triggering a call to
> > appl_request_upstream_resolve(). This call in turn sees that the
> > error has been set and just sends out the error-response to the client
> > and frees the appl_request_upstream. From here we return back to
> > appl_response(), which also calls appl_request_upstream_resolve(),
> > resulting in a use after free.
> > 
> > The main tool for fixing this issue is making use of
> > appl_request_upstream's aru_locked member, which will cause
> > appl_request_upstream_resolve() to return instantly. The simplest fix is
> > to set aru_locked before calling appl_request_downstream_free() and
> > unsetting it directly afterwards inside appl_response().
> > 
> > The second one is the diff proposed below, which shrinks the code.
> > 
> > appl_request_upstream_free() is only called once from
> > appl_request_upstream_reply(). appl_request_upstream_reply() in turn
> > is only called by appl_request_upstream_resolve().
> > appl_request_upstream_resolve() is called in 3 places:
> > - appl_processpdu(): to kick things off
> > - appl_request_downstream_free(): For when a backend disappears with
> >     outstanding requests
> > - appl_response(): To kickstart the next round of resolving.
> > 
> > Since appl_request_downstream_free() is always called from
> > appl_response(), we can leverage that function and make it call
> > appl_request_upstream_resolve() unconditionally.
> > 
> > appl_request_downstream_free() is called from the following locations:
> > - appl_close(): When a backend has disappeared.
> > - appl_request_upstream_free(): We send out a reply early, because an
> >     error has been detected.
> > - appl_response(): We received a response
> > 
> > appl_request_upstream_free() can't reenter into
> > appl_request_upstream_resolve(), or it would potentially trigger new
> > appl_request_downstreams. This can be prevented by setting aru_locked
> > before calling appl_request_downstream_free().
> > For all other cases we should rely on appl_request_upstream_resolve()'s
> > logic to handle varbinds in any state, so there's no reason make calls
> > from other contexts conditional.
> 
> Your description of the bug makes sense and your choice of resolving it
> as well. Thanks for the in-depth explanation, that helped a lot. Still,
> I must say that I don't really feel at ease with the amount of complexity
> and entanglement here. I simply can't fit all of this into my head within
> a reasonable amount of time.
> 
Well, the S in SNMP doesn't stand for "I want to pull my hair out" for
no reason.

So for people interested, this is the high-over flow of the code:
When a message is received snmpe.c sends the PDU part to
appl_processpdu(). This function pulls apart the request in structs that
can be worked with internally (1 struct appl_request_upstream + n
struct appl_varbind_internal).
After things are pulled apart appl_processpdu() calls
appl_request_upstream_resolve(), which is responsible for retrieving
the values for the requested varbinds. It does this by going over all
varbinds in the APPL_VBSTATE_NEW state, and calling
appl_varbind_backend() to find a matching backend for the request. In
the case of a getnext/getbulk request it can also increment the oid
to the next matching region (if needed) and it also fills the end oid
where the ownership of the backend ends. After all APPL_VBSTATE_NEW
varbinds have found their backend, they are grouped together by backend
in a struct appl_request_downstream and send out via
appl_request_downstream_send().

Once a backend has a response available it calls appl_response(). This
function verifies the returned data via appl_varbind_valid() and
appl_error_valid(), and checks if the data matches the supported
datatypes (e.g. counter64 on SNMPv1). After that the data is put into
the corresponding struct appl_varbind_internal, or in case of an EOMV
we increment the OID to the end OID of the last request and reset the
varbind to APPL_VBSTATE_NEW. After that the struct
appl_request_downstream is freed and we return back to
appl_request_upstream_resolve(). If all varbinds have reached
APPL_VBSTATE_DONE (or an error occured) we call
appl_request_upstream_reply() which creates a response-pdu and hands
it back over to snmpe.c to be returned to the requester.

Some of the trickier parts come from that a backend can call back into
appl_response() both before returning from their get* function, and
at any other arbitrary point later in time. The legacy backend has all
the data available straight away and calls back before returning from
the get*(), while the agentx backend sends out the request and we wait
until we get an answer on the socket, which is always after we've
returned from the get*(). The legacy backend is the reason for the
aru_locked variable, since we don't want more than one
appl_request_upstream_resolve() running per struct
appl_request_upstream.
Another issue is that a backend can disappear before a response is
returned. This is handled via appl_close(), which first removes all
registered regions, followed by invalidating any outstanding struct
appl_request_downstream, which triggers a next round of
appl_request_upstream_resolve().

Hope this helps to digest the insanity a little better.

Reply via email to