Re: 3rd Party Nonsense (was Re: Regular Expressions?)

Michael Ash Mon, 09 Jun 2008 22:38:56 -0700

On Mon, Jun 9, 2008 at 8:17 PM, Jens Alfke <[EMAIL PROTECTED]> wrote:
>
> On 8 Jun '08, at 3:39 AM, Michael Ash wrote:
>
>> I do this with a fair amount of regularity. NSString is unsuitable for
>> working with data whose encoding is unknown or doubtful, and NSData
>> doesn't have any string-like functionality, so the standard C str
>> functions can be very useful here.
>
> Ouch. The problem with those is that, every time you call one, you've added
> a potential buffer overrun bug to your app. And if the data in the string
> came from an untrusted source like the network, that escalates to a
> potential security vulnerability.


Sorry, what? It's perfectly possible to write safe code that calls C
str functions. My code is no more vulnerable than the next man's. You
can call things like strnstr, pass the length of the NSData you're
working on, and there is exactly zero risk of anything.

> Also, speaking of doubtful encodings, the regular C string functions will
> fail quite badly on 16-bit character encodings, where it's more than likely
> that every other byte is a zero.

While true, this is also irrelevant when you know that your data is
not, in fact, 16-bit. I use this technique when the data is known to
be ASCII-like but exactly what kind of ASCII-like encoding is unknown.
It would be nonsensical to use it for UTF-16 data, and thus I don't.

> My general tactic when dealing with unknown data whose encoding can't be
> determined is to just fall back on CP-1252 [though Aki Inoue suggested
> MacRoman], both of which are supersets of ascii that map every byte to a
> character. That way you'll always get a non-nil NSString, and any ascii text
> in the original will come out unscathed. That's a better result than you'll
> get with C string APIs.

No, it's not. A common technique is to use C string APIs to find line
endings, then try the full line as UTF-8. If it fails, then you can
fall back on a more forgiving encoding. This will give correct results
for UTF-8, which in many cases is the expected encoding, which is very
nice to have. Turning well-formed UTF-8 text into long strings of
nonsense characters is generally undesirable. It also has an extremely
low probability of false positives, as it's difficult to construct a
sensible string in a different encoding which is also valid UTF-8. The
fallback guarantees that you can at least try something if you get
data that you don't expect.

You may not be familiar with this technique but that doesn't mean it's
bad. It's good and useful in many situations.

Mike
_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]

Re: 3rd Party Nonsense (was Re: Regular Expressions?)

Reply via email to