route for large read buffer

Alexander Duyck Fri, 04 Nov 2016 12:13:46 -0700

On Fri, Nov 4, 2016 at 12:07 PM, Jason Baron <jba...@akamai.com> wrote:
>
>
> On 11/04/2016 02:43 PM, Alexander Duyck wrote:
>>
>> On Fri, Nov 4, 2016 at 7:45 AM, Jason Baron <jba...@akamai.com> wrote:
>>>
>>> From: Jason Baron <jba...@akamai.com>
>>>
>>> When read() is called on /proc/net/route requesting a size that is one
>>> entry size (128 bytes) less than m->size or greater, the resulting output
>>> has missing and/or duplicate entries. Since m->size is typically
>>> PAGE_SIZE,
>>> for a PAGE_SIZE of 4,096 this means that reads requesting more than 3,968
>>> bytes will see bogus output.
>>>
>>> For example:
>>>
>>> for i in {100..200}; do
>>>         ip route add 192.168.1.$i dev eth0
>>> done
>>> dd if=/proc/net/route of=/tmp/good bs=1024
>>> dd if=/proc/net/route of=/tmp/bad bs=4096
>>>
>>> # diff -q /tmp/good /tmp/bad
>>> Files /tmp/good and /tmp/bad differ
>>>
>>> I think this has gone unnoticed, since the output of 'netstat -r' and
>>> 'route' is generated by reading in 1,024 byte increments and thus not
>>> corrupted. Further, the number of entries in the route table needs to be
>>> sufficiently large in order to trigger the problematic case.
>>>
>>> The issue arises because fib_route_get_idx() does not properly handle
>>> the case where pos equals iter->pos. This case only arises when we have
>>> a large read buffer size because we end up re-requesting the last entry
>>> that overflowed m->buf. In the case of a smaller read buffer size,
>>> we don't exceed the size of m->buf, and thus fib_route_get_idx() is
>>> called
>>> with pos greater than iter->pos.
>>>
>>> Fix by properly handling the iter->pos == pos case.
>>>
>>> Fixes: 25b97c016b26 ("ipv4: off-by-one in continuation handling in
>>> /proc/net/route")
>>> Cc: Andy Whitcroft <a...@canonical.com>
>>> Cc: Alexander Duyck <alexander.h.du...@intel.com>
>>> Signed-off-by: Jason Baron <jba...@akamai.com>
>>> ---
>>>  net/ipv4/fib_trie.c | 12 ++++++++++--
>>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
>>> index 31cef3602585..1017533fc75c 100644
>>> --- a/net/ipv4/fib_trie.c
>>> +++ b/net/ipv4/fib_trie.c
>>> @@ -2411,12 +2411,17 @@ static struct key_vector
>>> *fib_route_get_idx(struct fib_route_iter *iter,
>>>                                             loff_t pos)
>>>  {
>>>         struct key_vector *l, **tp = &iter->tnode;
>>> +       loff_t saved_pos = 0;
>>>         t_key key;
>>>
>>>         /* use cache location of next-to-find key */
>>>         if (iter->pos > 0 && pos >= iter->pos) {
>>>                 pos -= iter->pos;
>>>                 key = iter->key;
>>> +               if (pos == 0) {
>>> +                       saved_pos = iter->pos;
>>> +                       key--;
>>> +               }
>>>         } else {
>>>                 iter->pos = 0;
>>>                 key = 0;
>>> @@ -2436,10 +2441,13 @@ static struct key_vector
>>> *fib_route_get_idx(struct fib_route_iter *iter,
>>>                         break;
>>>         }
>>>
>>> -       if (l)
>>> +       if (l) {
>>>                 iter->key = key;        /* remember it */
>>> -       else
>>> +               if (saved_pos)
>>> +                       iter->pos = saved_pos;
>>> +       } else {
>>>                 iter->pos = 0;          /* forget it */
>>> +       }
>>>
>>>         return l;
>>>  }
>>
>>
>> This doesn't seem correct to me.  I will have to look through this.
>> My understanding is that the value of iter->pos is supposed to be the
>> next position for us to grab, not the last one that was retrieved.  If
>> we are trying to re-request the last value then we should be falling
>> back into the else case for this since pos should be one less than
>> iter->pos.  The problem is the table could change out from under us
>> which is one of the reasons why we don't want to try and rewind the
>> key like you are doing here.
>>
>> - Alex
>>
>
> Hi Alex,
>
> In this case, seq_read() has called m->op->next(), which sets iter->pos
> equal to pos and iter->key to key + 1. However, when we then go to output
> the item associated with key, the 'm->op->next()' call overflows. Thus, we
> have a situation where iter->pos equals pos, iter->key = key + 1, but we
> have not displayed the item at position 'key' (thus the bug is that we miss
> the item at key).
>
> The change I proposed was simply to restart the search from 'key' in this
> case. If that item has disappeared, we will output the next one, or if its
> been replaced we will display its replacement. I think that is
> ok?
>
> The bug could also be fixed by changing:
>
> if (iter->pos > 0 && pos >= iter->pos) {
>
> to say:
>
> if (iter->pos > 0 && pos > iter->pos) {
>
> But that restarts the search on every overflow, which could mean every page
> size, and that seems suboptimal to me. Like-wise, if we make pos 1 less than
> iter->pos that restarts the search. The idea with this patch is to not force
> us to redo the entire search on each overflow.
>
> Thanks,
>
> -Jason


Actually I think the underlying issue is that we still have an
unresolved off by one error.  Specifically offset 0 actually
represents two values, the header for the file and entry 0.  I think
the fix is for us to look at pushing pos to 1 for the start of the
data, and we reserve POS 0 for SEQ_START_TOKEN. Doing that we should
start at the correct offset for each section following the first page.

- Alex

Re: [PATCH net] fib_trie: correct /proc/net/route for large read buffer

Reply via email to