> However, I still can't entirely shake the notion that we're overdoing it > here. Maybe we could simply make the preprocessor and compiler grok > UTF8 directly and get rid of the special casing. All compiler > input processing would return back to 8-bit only.
Converting everything to utf8 before preprocessing would work, yes, if it is then converted back to unicode before the tokenization. The alternative is a needlessly messy (handling utf-8 in the tokenizer). Define name/argument handling would be the only thing that needs to be altered in cpp to handle utf-8. Then again, just switching data[i] to IND(i) or similar, and have that be defined to index_shared_string(data,i) (or, to break with convetions in the code, not use a macro at all and instead just use the function directly) is actually significantly easier than adding utf-8 support to the preprocessor. It is however bound to be somewhat slower in most cases. But I do not really think the difference matters at all, considering everything else we are doing in there. -- Per Hedbor
