On Feb 2, 2009, at 7:50 PM, Joar Wingfors wrote:
On Feb 2, 2009, at 6:02 PM, Seth Willits wrote:
Before opening the file, either determine, guess, or be told what the encoding is. With that encoding, convert your delimiter string into raw bytes, then do byte-for-byte comparison on the file to find occurrences of that delimiter.

How do you know what delimiter string to use? Another thing that you'd have to determine, guess or be told, right? In general I would guess that it in this case almost always would be impossible and / or inappropriate to attempt to determine either of these two, and that you would have to simply default to something reasonable.

That's right, though heuristics work better for the guess-line-ending problem than they do for the guess-encoding problem. If you scan the first few KB of a sufficiently-long file and see exactly one kind of line ending, it's a good bet that you're right.


If you have an encoding where characters are not of fixed width, is it generally safe to assume that the byte signature of the valid delimiter strings for that encoding cannot also be found as a sub pattern of some combination of other characters? Perhaps that would always be a safe assumption, I'm no expert on string encodings and line delimiters.

Safe in some encodings, unsafe in others. I'm pretty sure that UTF-8 is safe - that no valid UTF-8 character is a subsequence of any other valid UTF-8 character.


--
Greg Parker     gpar...@apple.com     Runtime Wrangler


_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to