I have some string *str* of unicode characters. The question is how to check if I have valid unicode code point starting at code unit *index*?

I need it because I try to write parser that operates on string by *code unit*. If more precisely I trying to write function *matchWord* that should exctract whole words (that could consist not only English letters) from text. This word then compared with word from parameter. I want to not decode if it is not necessary. But looks like I can't do it without decoding, because I need to know if current character is letter of alphabet and not punctuation or whitespace for example.

There is how I think this look like. In real code I have template algorithm that operates on differrent types of strings: string, wstring, dstring.

struct Lexer
{
        string str;
        size_t index;

        bool matchWord(string word)
        {
                size_t i = index;
                while( !str[i..$].empty )
                {
                        if( !str.isValidChar(i) )
                        {
                                i++;
                                continue;
                        }
                        
                        uint len = str.graphemeStride(i);

                        if( !isAlpha(str[i..i+len]) )
                        {
                                break;
                        }
                        i++;
                }
                
                return word == str[index..i];
        }
}

It is just a draft of idea. Maybe it is complicated. What I want to get as a result is logical flag (matched or not) and position should be set after word if it is matched. And it should match whole words of course.

How do I implement it correctly without overhead and additional UTF decodings if possible?

And also how could I validate single char of string starting at code unit index? Also I don't like that graphemeStride can throw Exception if I point to wrong possition. Is there some nothrow version? I don't want to have extra allocations for exceptions.

Reply via email to