DaWorm schrieb:
On Sun, Sep 18, 2011 at 12:01 PM, Sven Barth
<pascaldra...@googlemail.com> wrote:
On 18.09.2011 17:48, DaWorm wrote:

But isn't it O(n^2) only when actually using unicode strings?

All MBCS encodings, with no fixed character size, suffer from that problem.

Wouldn't you also be able to do something like String.Encoding := Ansi
and then all String[i] accesses would then be o(n) + x (where x is the
overhead of run time checking that it is safe to just use a memory
offset, presumably fairly short)? Of course it would be up to the user
to choose to reencode some string he got from the RTL or FCL that way
and understand the consequences.

Calling subroutines for indexed access, instead of direct array access, will add another factor (10..100?) to single character access - including register save/restore and disallowed optimizations.

What assumptions are the typical String[i] user going to make about
what is returned?  There will be the types that are seeing if the
fifth character is a 'C' or something like that, and for those there
probably isn't too much that is going to go wrong, they might have to
switch to "C" instead, or the compiler can make the 'C' literal a
"unicode char which is really a string" conversion at compile time.
There may be the ones that want to turn a 'C' into a 'c' by flipping
the 6th bit, and that will indeed break, and in a Unicode world,
perhaps that should break, forcing using LowerCase as needed.

The simple upper/lower conversion works only for ASCII, not for Ansi chars.

 And
there are those (such as myself) who often use strings as buffers for
things like serial comms.  That code will totally break if I were to
try to use a unicode string buffer, but a simple addition of
String.Encoding := ANSI or RawByteString or ShortString in the first
line would fix that, or I could bite the bullet and recode that quick
and dirty code the right way.

Delphi introduced TBytes for non-character byte data.

My point is that trying to keep the bad
habits of a single byte string world in a unicode world is
counterproductive.  They aren't the same, and all attempts to make
them the same just cause more problems than they solve.

That's why I still suggest to use UTF-16 in user code. When the user skips all unknown chars, nothing can go wrong.

As for the RTL and FCL, presumably they wouldn't be doing any of this
Sting[i] stuff in the first place, would they? So they aren't going to
suffer that speed penalty.  Just because one type of code is slow,
doesn't mean everything is slow.

It's absolutely safe, even with UTF-8 strings, to e.g. search for all '\' separators, and to replace these in place with '/'. It's also safe to search for an set of (ASCII) separator chars, and to split strings at these positions (e.g. CSV). Bytewise case-insensitive comparison also works for all encodings, at least for equality. Other comparisons are much slower, due to the required lookup of the sort order values (maybe alphabetic, dictionary etc.), and again with every encoding. Even with ASCII there exists a choice of sorting 'a' like 'A', after 'A' or after 'Z'.

DoDi

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Reply via email to