Re: [PATCH] SBuf c-string comparisons

Amos Jeffries Mon, 14 Apr 2014 03:50:25 -0700

On 14/04/2014 5:53 a.m., Alex Rousskov wrote:
> On 04/13/2014 05:13 AM, Amos Jeffries wrote:
>>>> +    /// comparison with a C-string
>>>> +    int compare(const char *s, SBufCaseSensitive isCaseSensitive, const 
>>>> size_type n) const;
>>> ...
>>>> +    const size_type byteCompareLen = min(n, length());
>>>> +    ++stats.compareSlow;
>>>> +    int rv = 0;
>>>> +    if (isCaseSensitive == caseSensitive) {
>>>> +        rv = memcmp(buf(), s, byteCompareLen);
>>>> +    } else {
>>>> +        rv = memcasecmp(buf(), s, byteCompareLen);
>>>> +    }
>>>
>>> The code above may access s bytes beyond s's 0-terminator (if any). The
>>> header file claims to provide a C-string [i.e., strncmp()] interface,
>>> but the .cc file implements something close to a memcmp() interface instead.
>>>
>>> The difference between memcmp and strcmp APIs is that the former stops
>>> at n while the latter also stops at the first terminator character
>>> (whichever comes first). The difference is not important for readable
>>> text, of course, but it becomes critical when dealing with "binary" data.
>>>
>>
>> memcmp() is documented at stopping at the byte when the strings differ.
> 
> Both memcmp and strncmp do that, of course. The difference is in where
> they stop when characters they looked at so far do _not_ differ, as
> summarized above.
> 
> 
>> The if-statement guaranteeing length() <= n, 
> 
> Yes.
> 
> 
>> will truncate this SBuf so min(n, length()) is never greater than n
> 
> Sure (by the definition of min).
> 
> 
>> and thus should never be strlen(s)+terminator.
> 
> This part (if I interpret it right) is incorrect. The "n" parameter in a
> strncmp API may exceed both length() and strlen(s) by an arbitrary
> amount. Consider the following for a c-string-based API:
> 
>   SBuf: 8192 bytes: "GET" followed by 8189 null characters.
>   s:       4 bytes: "GET" followed by a single null character.
>   n:   16384
> 
>   Expected SBuf.compare(s,, n) result: 0 (per c-string API **)
>   Actual   SBuf.compare(s,, n) result: random with occasional segfaults
> 
> The key here is that n does not tell strncmp() that s has at least n
> characters. It only tells strncmp() that it should not go beyond n-th
> character when performing a comparison. strncmp() must stop:
> 
>   * at the of end of the left c-string
>   * at the end of the right c-string
>   * at the first different character
>   * after n equal characters have been compared
> 
> whichever happens first. The proposed code implements a different API.
> 
> 
> 
>> NOTE that strn*cmp variants will produce ==0 results in the above broken
>> case whereas SBuf comparison will produce ==1 or ==-1 (input sizes differ).
> 
> Let's investigate that after strncmp variants are fixed so that we are
> not comparing broken code.
> 
> 
> 
>>>> +    // pretend we have a terminator and check against s.
>>>> +    return s[length()] == '\0' ? 0 : -1;
>>>
>>> Similar to the above, what if s[length()] does not exist? For example,
>>> in pseudo code:
>>>
>>>   SBuf buf("1234");
>>>   char s[3] = { '1', '2', '3', }; // s[3] does not exist
>>>   buf.cmp(s, 3);
>>>
>>> Please note that the above problem applies to both strCmp() and memCmp()
>>> flavors of the code.
>>
>> The SBuf truncation path is taken on buf. At which point it is
>> memcmp("123", s, 3) ==> 0.
> 
> You are right. I cannot find an example where s[length()] does not exist
> in the current code, sorry.
> 
> Please note that this is separate from the "C-string comparison must
> stop at the first null-terminator" problem discussed above. That problem
> is still open and may result in invalid accesses during memcmp() and
> memcasecmp() calls.
> 
> 
> 
> 
>>> Also, it is not clear (to me) what the above code tries to accomplish.
>>> The comment does not really clarify things [for me], especially since
>>> "we" usually refers to the "this" object but the expression does not
>>> check the "this" object buffer. If this code stays, please rephrase.
>>
>> C-string comparisons are including the terminator bytes for their final
>> EOS results **.
>>
>> This handles all the cases where:
>>  SBuf  left  = "foo"
>>  char* right = "foo"
>>  left.cmp(right, 4);
>>
>> In c-string comparisions these produce ==0. Using the X-longer-then-Y
>> logic pattern it would produce ==1 like "foo" vs "fooz".
>>
>> Since SBuf has no terminator we cannot compare the terminator byte like
>> c-string comparisons do. Accessing buf[length()]==s[length()] would be a
>> buffer overrun on buf. So we check '\0' == s[length()] instead.
>>
>> ie. we are guaranteed to be at the end of 'buf', check if 's' is the
>> c-string terminator.
> 
> Let's come back to this once the primary search code is fixed. I am sure
> we can find a way to clarify this code description if it stays.


I've added a unit test to catch the rare \0 mid-string case I spoke of:
 * SBuf("foo\0test").compare("foo", 9);

It works fine up to the point N(4) > strlen("foo"). After that point our
function returns 1 to indicate that the SBuf is a longer string, whereas
strncmp/strncasecmp return 0 up to infinity.

The code as presented earlier copes with the weirdness fine.

NOTE: I can swap memcmp/strncmp and memcasecmp/strncasecmp functions
interchangeably with the same test results. So I'm fine with using any
of them as the backing scan.

Amos

Re: [PATCH] SBuf c-string comparisons

Reply via email to