if you knew flex you could understand On Mon, Aug 18, 2008 at 1:55 PM, Ricky Sharp <[EMAIL PROTECTED]> wrote: > > On Aug 18, 2008, at 3:40 PM, mm w wrote: > >> to avoid the splitting problem >> >> (c < 128) ? "%c" : "\\u%04x", c); > > I'm not sure what this solves. > > Per Michael's e-mail below, this is indeed a difficult problem. UTF-8 is > just a particular scheme to store Unicode strings. Operating on individual > bytes in such streams will most likely not make any sense. > > What I would do is pick some normalized form and operate on that data. For > a recent feature at my day job, we normalized all input CSV files to > UTF-16BE. We were able to handle all of our customer data so far. The > final solution still isn't 100% Unicode-savvy (e.g. it does crap-out with > surrogate pairs), but we have unit tests to expose/document such > limitations. And, customer data doesn't yet have such things. > > >> On Sat, Aug 16, 2008 at 7:43 AM, Michael Ash <[EMAIL PROTECTED]> >> wrote: >>> >>> - It's very difficult to split UTF-8 strings correctly. If you >>> encounter a run of non-ASCII characters, ensure that you follow that >>> run through the end, until you get back to ASCII. Don't have a regex >>> that stops in the middle of it and then expects your code to be able >>> to do something useful with it. >>> > > ___________________________________________________________ > Ricky A. Sharp mailto:[EMAIL PROTECTED] > Instant Interactive(tm) http://www.instantinteractive.com > > > >
-- -mmw _______________________________________________ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to [EMAIL PROTECTED]