if you knew flex you could understand

On Mon, Aug 18, 2008 at 1:55 PM, Ricky Sharp <[EMAIL PROTECTED]> wrote:
>
> On Aug 18, 2008, at 3:40 PM, mm w wrote:
>
>> to avoid the splitting problem
>>
>> (c < 128) ? "%c" : "\\u%04x", c);
>
> I'm not sure what this solves.
>
> Per Michael's e-mail below, this is indeed a difficult problem.  UTF-8 is
> just a particular scheme to store Unicode strings.  Operating on individual
> bytes in such streams will most likely not make any sense.
>
> What I would do is pick some normalized form and operate on that data.  For
> a recent feature at my day job, we normalized all input CSV files to
> UTF-16BE.  We were able to handle all of our customer data so far.  The
> final solution still isn't 100% Unicode-savvy (e.g. it does crap-out with
> surrogate pairs), but we have unit tests to expose/document such
> limitations. And, customer data doesn't yet have such things.
>
>
>> On Sat, Aug 16, 2008 at 7:43 AM, Michael Ash <[EMAIL PROTECTED]>
>> wrote:
>>>
>>> - It's very difficult to split UTF-8 strings correctly. If you
>>> encounter a run of non-ASCII characters, ensure that you follow that
>>> run through the end, until you get back to ASCII. Don't have a regex
>>> that stops in the middle of it and then expects your code to be able
>>> to do something useful with it.
>>>
>
> ___________________________________________________________
> Ricky A. Sharp         mailto:[EMAIL PROTECTED]
> Instant Interactive(tm)   http://www.instantinteractive.com
>
>
>
>



-- 
-mmw
_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]

Reply via email to