Re: [PATCH] Tokenizer

Kinkie Mon, 02 Jun 2014 12:33:28 -0700

Hi,
  adapting the current implementation to the "extra optional
parameter" was rather trivial; please consider the attached patch,
whose bulk is unit tests and (certainly improvable) documentation.


On Mon, Jun 2, 2014 at 8:07 PM, Alex Rousskov
<[email protected]> wrote:
> On 06/02/2014 11:27 AM, Kinkie wrote:
>>>> +    //fixme: account for buf_.size()
>>>> +    bool neg = false;
>>>> +    const char *s = buf_.rawContent();
>>>> +    const char *end = buf_.rawContent() + buf_.length();
>>>
>>> Could you please explain what the above comment means? There is no
>>> SBuf::size() AFAICT.
>>
>> length() is an alias of size(). Changed to size()  to avoid confusion
>
> The above code already accounts for .length() or .size(), right? What is
> there to fix? Is it a bug (XXX) or just an future improvement note (TODO)?
>
>
>>>> +    * At least one terminating delimiter is required. \0 may be passed
>>>> +    * as a delimiter to treat end of buffer content as the end of token.
>>>
>>>> +    if (tokenLen == SBuf::npos && !delimiters['\0']) {
>>>
>>> The above is from the actually committed code (not the code posted for
>>> review). Please remove \0 as a special case. As far as I can tell, the
>>> reason it was added is already covered well enough by the existing API.
>>> The addition of this special case is likely to just confuse and break
>>> things. For example, adding \0 to delimiters for the purpose of
>>> disabling this special case will also tell Tokenizer to skip \0
>>> characters in input, which is often wrong.
>>
>> I'll let Amos argue for this as he's the one who has the clearest
>> picture of the use case.
>> Given your objection, my 2c suggestion to address the need is to
>> either have two token() methods, one which requires a separator at the
>> end of a token and one which doesn't, or to have a bool flag passed to
>> the method.
>
> I hope neither is needed, but I will wait for the "use case" discussion.
>
> BTW, if possible, please work on your proposal to avoid these kind of
> rushed commits of knowingly contested code.
>
>
> Thank you,
>
> Alex.
>



-- 
    Francesco

=== modified file 'src/parser/Tokenizer.cc'
--- src/parser/Tokenizer.cc	2014-06-02 17:52:31 +0000
+++ src/parser/Tokenizer.cc	2014-06-02 18:59:23 +0000
@@ -29,12 +29,12 @@
 #endif
 
 bool
-Parser::Tokenizer::token(SBuf &returnedToken, const CharacterSet &delimiters)
+Parser::Tokenizer::token(SBuf &returnedToken, const CharacterSet &delimiters, const bool requireComplete)
 {
     const SBuf savebuf(buf_);
     skip(delimiters);
     const SBuf::size_type tokenLen = buf_.findFirstOf(delimiters); // not found = npos => consume to end
-    if (tokenLen == SBuf::npos && !delimiters['\0']) {
+    if (tokenLen == SBuf::npos && requireComplete) {
         // no delimiter found, nor is NUL/EOS/npos acceptible as one
         buf_ = savebuf;
         return false;

=== modified file 'src/parser/Tokenizer.h'
--- src/parser/Tokenizer.h	2014-06-02 00:15:10 +0000
+++ src/parser/Tokenizer.h	2014-06-02 18:59:24 +0000
@@ -42,12 +42,16 @@
      *
      *  Want to extract delimiters? Use prefix() instead.
      *
-     * At least one terminating delimiter is required. \0 may be passed
-     * as a delimiter to treat end of buffer content as the end of token.
-     *
+     * \param returnedToken output parameter, returns the token if tokenization
+     *        was successful; untouched otherwise
+     * \param delimiters the delimiters to be found (optionally before) and
+     *        after the token.
+     * \param requreComplete if true, requires that a delimiter be found after
+     *        the token for a successful parsing. If false or not provided,
+     *        end of parse buffer will be considered as an implicit delimiter
      * \return false if no terminal delimiter is found.
      */
-    bool token(SBuf &returnedToken, const CharacterSet &delimiters);
+    bool token(SBuf &returnedToken, const CharacterSet &delimiters, const bool requireComplete = false);
 
     /** Accumulates all sequential permitted characters up to an optional length limit.
      *

=== modified file 'src/parser/testTokenizer.cc'
--- src/parser/testTokenizer.cc	2014-06-02 00:15:10 +0000
+++ src/parser/testTokenizer.cc	2014-06-02 19:28:27 +0000
@@ -82,19 +82,45 @@
 void
 testTokenizer::testTokenizerToken()
 {
-    Parser::Tokenizer t(text);
-    SBuf s;
-
-    // first scenario: patterns match
-    CPPUNIT_ASSERT(t.token(s,whitespace));
-    CPPUNIT_ASSERT_EQUAL(SBuf("GET"),s);
-    CPPUNIT_ASSERT(t.token(s,whitespace));
-    CPPUNIT_ASSERT_EQUAL(SBuf("http://resource.com/path";),s);
-    CPPUNIT_ASSERT(t.token(s,whitespace));
-    CPPUNIT_ASSERT_EQUAL(SBuf("HTTP/1.1"),s);
-    CPPUNIT_ASSERT(t.token(s,whitespace));
-    CPPUNIT_ASSERT_EQUAL(SBuf("Host:"),s);
-
+    {
+        // match
+        Parser::Tokenizer t(text);
+        SBuf s;
+
+        // first scenario: patterns match
+        CPPUNIT_ASSERT(t.token(s,whitespace));
+        CPPUNIT_ASSERT_EQUAL(SBuf("GET"),s);
+        CPPUNIT_ASSERT(t.token(s,whitespace));
+        CPPUNIT_ASSERT_EQUAL(SBuf("http://resource.com/path";),s);
+        CPPUNIT_ASSERT(t.token(s,whitespace));
+        CPPUNIT_ASSERT_EQUAL(SBuf("HTTP/1.1"),s);
+        CPPUNIT_ASSERT(t.token(s,whitespace));
+        CPPUNIT_ASSERT_EQUAL(SBuf("Host:"),s);
+    }
+    {
+        // complete / incomplete tokens
+        SBuf token;
+        Parser::Tokenizer t(token);
+
+        t.reset(SBuf("  foo  "));
+        CPPUNIT_ASSERT(t.token(token,whitespace));
+        CPPUNIT_ASSERT_EQUAL(SBuf("foo"),token);
+        token.clear();
+
+        t.reset(SBuf("  foo  "));
+        CPPUNIT_ASSERT(t.token(token,whitespace,true));
+        CPPUNIT_ASSERT_EQUAL(SBuf("foo"),token);
+        token.clear();
+
+        t.reset(SBuf("  foo"));
+        CPPUNIT_ASSERT(t.token(token,whitespace));
+        CPPUNIT_ASSERT_EQUAL(SBuf("foo"),token);
+        token.clear();
+
+        t.reset(SBuf("  foo"));
+        CPPUNIT_ASSERT(!t.token(token,whitespace,true));
+        CPPUNIT_ASSERT_EQUAL(SBuf(""),token);
+    }
 }
 
 void

Re: [PATCH] Tokenizer

Reply via email to