I don't think, full UTF-16 really would be desirable desirable over UC-2.
Imagine you have a string of some million characters (e.g. a Book). All
functions that need to find the n-th character (like x[n], copy, ...)
would take forever, as they need to scan the complete string (if not
The encoding can be important for speed:
For example the widestring xml parser is up to 10 times slower than
the ansistring xml parser.
That obviously is the reason why Turbo - Delphi uses UCS-2 (16 bit)
instead of OF UTF-8 or UTF-16 for WideStrings (and WideChar is a 16 bit
(UCS-2)
s[i]:='x' doesn't work in UTF-8, nor UTF-16, nor UTF-32.
It would work, but it would need an implementation that moves the tail
of the string around and thus would be really slow.
-Michael
___
fpc-devel maillist -
Am Montag, 29. September 2008 09:25 schrieb Michael Schnell:
The encoding can be important for speed:
For example the widestring xml parser is up to 10 times slower than
the ansistring xml parser.
That obviously is the reason why Turbo - Delphi uses UCS-2 (16 bit)
instead of OF UTF-8 or
That obviously is the reason why Turbo - Delphi uses UCS-2 (16 bit)
instead of OF UTF-8 or UTF-16 for WideStrings (and WideChar is a 16
bit (UCS-2) value).
You didn't read http://www.jacobthurman.com/?p=30 , did you?
They are talking about Delphi 2009, of which I don't have any
Michael Schnell schrieb:
The encoding can be important for speed:
For example the widestring xml parser is up to 10 times slower than
the ansistring xml parser.
That obviously is the reason why Turbo - Delphi uses UCS-2 (16 bit)
instead of OF UTF-8 or UTF-16 for WideStrings (and WideChar
are you sure they are using UCS2 and not some 16bit codepages? That
exists also ;)
Not really.
I checked the unicodes 0x0100 and 0x0101 (capital and lower case A
with a dash). Same can correctly be viewed in the debugger when pointing
to the WideString variable. So I guess it indeed is
On Sunday 28 September 2008 00.10:43 Graeme Geldenhuys wrote:
On Fri, Sep 26, 2008 at 5:02 PM, Mattias Gaertner
[EMAIL PROTECTED] wrote:
s[i]:='x' doesn't work in UTF-8, nor UTF-16, nor UTF-32.
In short:
A single character for all purposes can not be defined. Unicode can not
be
On Sun, 28 Sep 2008 09:23:14 +0200
Martin Schreiber [EMAIL PROTECTED] wrote:
On Sunday 28 September 2008 00.10:43 Graeme Geldenhuys wrote:
On Fri, Sep 26, 2008 at 5:02 PM, Mattias Gaertner
[EMAIL PROTECTED] wrote:
s[i]:='x' doesn't work in UTF-8, nor UTF-16, nor UTF-32.
In short:
On Sun, Sep 28, 2008 at 12:22 PM, Mattias Gaertner
[EMAIL PROTECTED] wrote:
You can not normalize the composed and decomposed state platform
independently. For example Linux ext3 does not normalize in any
way and therefore distinguish between composed a-umlaut and decomposed
a-umlaut. You can
On Sunday 28 September 2008 20.16:36 Graeme Geldenhuys wrote:
On Sun, Sep 28, 2008 at 12:22 PM, Mattias Gaertner
Is this normalized form used only internally in msegui or must the user
use them too?
I remember when I tried a MSEgui version some time back, that the IDE
itself used that
Graeme Geldenhuys wrote:
(AFAI understand, a Widechar is just 16 bit, it would need to
be 32 bit if surrogates were allowed in Widestrings).
Good question and I have been wondering about this myself. In D2009
SizeOf(Char) = 2, so I have no idea how that works with surrogate
pairs. Can
On Sat, Sep 27, 2008 at 2:35 PM, Luiz Americo Pereira Camara
[EMAIL PROTECTED] wrote:
Good question and I have been wondering about this myself. In D2009
SizeOf(Char) = 2, so I have no idea how that works with surrogate
pairs. Can anybody explain this please?
In
On Thu, Sep 25, 2008 at 10:33 PM, Florian Klaempfl
[EMAIL PROTECTED] wrote:
Who says that? UTF-16 is simply chosen because it has features (supporting
all characters basically) ANSI doesn't?
Sorry, my message was unclear and I got somewhat mixed up between ANSI
and UTF-8. I meant the encoding
Op Fri, 26 Sep 2008, schreef Graeme Geldenhuys:
On Thu, Sep 25, 2008 at 10:33 PM, Florian Klaempfl
[EMAIL PROTECTED] wrote:
Who says that? UTF-16 is simply chosen because it has features (supporting
all characters basically) ANSI doesn't?
Sorry, my message was unclear and I got somewhat
Graeme Geldenhuys schreef:
On Thu, Sep 25, 2008 at 10:33 PM, Florian Klaempfl
I suppose it would be viable doing timing results for saving text
files as well. After all, 99% of the time, text files are stored in
UTF-8.
Where did you get that number (99%) from? I don't think that is true,
On Fri, Sep 26, 2008 at 9:04 AM, Graeme Geldenhuys
[EMAIL PROTECTED] wrote:
So has anybody actually done a timing comparision? Do you have your
test code available? Do you have your results published? I'm
interested to see the timing results using different hardware.
What I'm getting at, is
Graeme Geldenhuys schrieb:
On Thu, Sep 25, 2008 at 10:33 PM, Florian Klaempfl
[EMAIL PROTECTED] wrote:
Who says that? UTF-16 is simply chosen because it has features (supporting
all characters basically) ANSI doesn't?
Sorry, my message was unclear and I got somewhat mixed up between ANSI
On Fri, Sep 26, 2008 at 09:04, Graeme Geldenhuys
[EMAIL PROTECTED] wrote:
On Thu, Sep 25, 2008 at 10:33 PM, Florian Klaempfl
[EMAIL PROTECTED] wrote:
I suppose it would be viable doing timing results for saving text
files as well. After all, 99% of the time, text files are stored in
UTF-8.
On Fri, Sep 26, 2008 at 9:12 AM, Daniël Mantione
[EMAIL PROTECTED] wrote:
For me the speed of input/output is less relevant, this is limited by disk
speed anyway. It's the speed of processing that should be decisive.
That's highly dependant on what you application does! If your
application
Graeme Geldenhuys schrieb:
On Fri, Sep 26, 2008 at 9:04 AM, Graeme Geldenhuys
[EMAIL PROTECTED] wrote:
So has anybody actually done a timing comparision? Do you have your
test code available? Do you have your results published? I'm
interested to see the timing results using different
On Fri, Sep 26, 2008 at 9:19 AM, Aleksa Todorovic [EMAIL PROTECTED] wrote:
I support decision of using UTF-16 over UTF-8. String processing is
far more simpler, it's actually as simple as it should be.
And that's guarenteed to work with surrogate pairs as well? The
problem is, most people
Graeme Geldenhuys schrieb:
On Fri, Sep 26, 2008 at 9:27 AM, Florian Klaempfl
[EMAIL PROTECTED] wrote:
Being honest, imo UTF-8 is only a hack to get unicode on platforms like
unix.
I don't know where you get that information,
Rather simple: initially in unicode 1.0 there was only a 16 bit
In our previous episode, Graeme Geldenhuys said:
Yes I know we have had lengthy discussions about this before.
Everybody (whoever they might be) keeps saying that UTF-16 was chosen
for Tiburon's UnicodeString because it makes significant speed gains
when calling the Windows API based on UTF-16
In our previous episode, Aleksa Todorovic said:
I suppose it would be viable doing timing results for saving text
files as well. After all, 99% of the time, text files are stored in
UTF-8. So in D2009 you would first have to convert UTF-16 to UTF-8 and
then save. And the opposite when
In our previous episode, Florian Klaempfl said:
On Fri, Sep 26, 2008 at 9:27 AM, Florian Klaempfl
[EMAIL PROTECTED] wrote:
Being honest, imo UTF-8 is only a hack to get unicode on platforms like
unix.
I don't know where you get that information,
Rather simple: initially in unicode
Well if you have Utf-8 versions of all basic string processing
functions like Pos, Length, Copy, Insert etc
s[i] := 'x'; will be especially funny :).
-Michael
___
fpc-devel maillist - fpc-devel@lists.freepascal.org
It's no different then UTF-16 if you want to do it properly. In both you
have to look out for surrogates.
Is UTF-16 Widestring in FPC (and Delphi 200x ? ) not done just ignoring
the surrogates ? (AFAI understand, a Widechar is just 16 bit, it would
need to be 32 bit if surrogates were
On Fri, Sep 26, 2008 at 10:43 AM, Michael Schnell [EMAIL PROTECTED] wrote:
It's no different then UTF-16 if you want to do it properly. In both you
have to look out for surrogates.
Is UTF-16 Widestring in FPC (and Delphi 200x ? ) not done just ignoring the
surrogates ?
Lets hope not,
Op Fri, 26 Sep 2008, schreef Graeme Geldenhuys:
On Fri, Sep 26, 2008 at 9:12 AM, Daniël Mantione
[EMAIL PROTECTED] wrote:
For me the speed of input/output is less relevant, this is limited by disk
speed anyway. It's the speed of processing that should be decisive.
That's highly dependant
In our previous episode, Michael Schnell said:
It's no different then UTF-16 if you want to do it properly. In both you
have to look out for surrogates.
Is UTF-16 Widestring in FPC (and Delphi 200x ? ) not done just ignoring
the surrogates ?
No different as UTF-8 in principle. Base
Graeme Geldenhuys schrieb:
On Thu, Sep 25, 2008 at 10:33 PM, Florian Klaempfl
[EMAIL PROTECTED] wrote:
Who says that? UTF-16 is simply chosen because it has features (supporting
all characters basically) ANSI doesn't?
Sorry, my message was unclear and I got somewhat mixed up
In our previous episode, Dani?l Mantione said:
That's highly dependant on what you application does! If your
application primarily parses text files, it's relevant. :-)
Shortstrings ansistrings won't go away. You'll still be able to code
fast text file parsers. Note that in such cases
Op Fri, 26 Sep 2008, schreef Graeme Geldenhuys:
On Fri, Sep 26, 2008 at 10:43 AM, Michael Schnell [EMAIL PROTECTED] wrote:
It's no different then UTF-16 if you want to do it properly. In both you
have to look out for surrogates.
Is UTF-16 Widestring in FPC (and Delphi 200x ? ) not done
Ivo Steinmann schrieb:
In the core of all windows nt systems, there's the NT API. The normal
WinAPI is on the top of the NTAPI. the NT API itself uses UTF-16 as
stringtype!
type
UNICODE_STRING = record
Length: USHORT;
MaximumLength: USHORT;
Buffer: PWSTR;
end;
const
On Fri, Sep 26, 2008 at 11:11 AM, Ivo Steinmann [EMAIL PROTECTED] wrote:
So in core, winnt is working with UTF16. All ANSI Winapi functions map
to these winnt calls.
So then there is already a conversion going on. From ANSI api to
UTF16 api. I still think (and will try and put together some
On Fri, Sep 26, 2008 at 11:17 AM, Daniël Mantione
[EMAIL PROTECTED] wrote:
Russian, Arabic, Japanese are languages in daily use on computers, countless
electronic documents in these languages exist.
And most documents that exist in the world are in UTF-8 format: Save
to file, HTML documents
On 26 Sep 2008, at 10:43, Michael Schnell wrote:
Is UTF-16 Widestring in FPC (and Delphi 200x ? ) not done just
ignoring the surrogates ?
At least the Unix widestring manager fully supports surrogates (except
if you use the MSIDE-patched version, where it has been removed
because it is
Op Fri, 26 Sep 2008, schreef Marco van de Voort:
In our previous episode, Dani?l Mantione said:
That's highly dependant on what you application does! If your
application primarily parses text files, it's relevant. :-)
Shortstrings ansistrings won't go away. You'll still be able to code
On Fri, Sep 26, 2008 at 11:31 AM, Marco van de Voort [EMAIL PROTECTED] wrote:
Someone writing a spell checker for old-Egyptian Hieroglyphs will have to
deal with surrogates. For those people UTF-16 has few advantages over
UTF-8, (allthough in practice it's still a bit easier to handle than
Op Fri, 26 Sep 2008, schreef Marco van de Voort:
In our previous episode, Dani?l Mantione said:
as I know D2009 (I think) handles this correctly, but I have no idea
how.
Let me put it like this: Someone writing a Russian/Arabic/Japanese spell
checker does not have to handle surrogates with
In our previous episode, Dani?l Mantione said:
Accepting both Arabic and Westernized Arabic numerals would possibly break a
lot of code anyway, since to string and back wouldn't be reversible.
It has never been reversible. Think about val('$100',v);
See one line further down.
On Friday 26 September 2008 09.34:44 Graeme Geldenhuys wrote:
Well if you have Utf-8 versions of all basic string processing
functions like Pos, Length, Copy, Insert etc you don't have to think
of encoding or anything. fpGUI uses UTF-8 internally, and I never have
to think about what encoding
In our previous episode, Martin Schreiber said:
Well if you have Utf-8 versions of all basic string processing
functions like Pos, Length, Copy, Insert etc you don't have to think
of encoding or anything. fpGUI uses UTF-8 internally, and I never have
to think about what encoding I'm
On Fri, Sep 26, 2008 at 11:46 AM, Martin Schreiber [EMAIL PROTECTED] wrote:
It seems you prefer utf-8 over utf-16 for internal string encoding in a GUI
framework. Why?
I prefer utf-16 over utf-8 for MSEide+MSEgui because *all* current users
(including the Chinese) can use simple string index
Marco van de Voort schrieb:
For many people Unicode is just let's go UTF-8. It's far more than that
and 100% supporting Unicode is even next to impossible.
Correct, but that is what I'm suggesting. UTF-16 is not a cure all either,
only at a first superficial glance. I'm btw not
In our previous episode, Martin Schreiber said:
Hmm, you should ask the Russian users for example if they prefer MSEgui
utf-16
internal encoding or Lazarus utf-8.
Users always look short term, and want to change as little as possible.
This goes both for UTF-16 (with the is UCS2
Hello Graeme,
Friday, September 26, 2008, 10:50:43 AM, you wrote:
GG Good question and I have been wondering about this myself. In D2009
GG SizeOf(Char) = 2, so I have no idea how that works with surrogate
GG pairs. Can anybody explain this please?
I don't know how D2009 and others do it, but
In our previous episode, Ivo Steinmann said:
in the native encoding per platform.
I guess that would be one of the best solutions. Having a system unicode
string type and then some specialized string types.
SysString
UTF8String
UTF16String
UTF32String
Anyway, I still think
Martin Schreiber wrote:
Hmm, you should ask the Russian users for example if they prefer MSEgui utf-16
internal encoding or Lazarus utf-8.
You are mixing things a bit. People from russian forum prefere less
bugs. And utf8 implementation of lazarus brought them alot. This is the
difference.
On Fri, Sep 26, 2008 at 12:34 PM, Marco van de Voort [EMAIL PROTECTED] wrote:
I guess that would be one of the best solutions. Having a system unicode
string type and then some specialized string types.
SysString
UTF8String
UTF16String
UTF32String
Anyway, I still think something like this
Is UTF-16 Widestring in FPC (and Delphi 200x ? ) not done just ignoring the
surrogates ?
Lets hope not,
I don't think, full UTF-16 really would be desirable desirable over UC-2.
Imagine you have a string of some million characters (e.g. a Book). All
functions that need to find the
need to be 32 bit if surrogates were allowed in Widestrings).
How to squeeze a value $ in a 16 Bit value ?
Can you magically store two bits in a single hardware cell ?
-Michael
___
fpc-devel maillist - fpc-devel@lists.freepascal.org
Op Fri, 26 Sep 2008, schreef Graeme Geldenhuys:
Taking a step back from Free Pascal and Tiburon How do other
frameworks handle string encodings etc... Frameworks like Java, Qt
etc... Can't we learn something from them as well? Both Java and Qt
run on multiple platforms, read/write to
On Friday 26 September 2008 12.30:27 Marco van de Voort wrote:
In our previous episode, Martin Schreiber said:
Hmm, you should ask the Russian users for example if they prefer MSEgui
utf-16 internal encoding or Lazarus utf-8.
Users always look short term, and want to change as little as
How do other
frameworks handle string encodings etc
With .NET/Mono I suppose you don't need to bother. But I suppose this is
one of the reasons that strings are constants once they are assigned
some value; and you can't so things like s[n] := 'x'.
-Michael
In our previous episode, Michael Schnell said:
need to be 32 bit if surrogates were allowed in Widestrings).
How to squeeze a value $ in a 16 Bit value ?
Can you magically store two bits in a single hardware cell ?
As said before, unicode is more than just expanding the range of
In our previous episode, Dani?l Mantione said:
Taking a step back from Free Pascal and Tiburon How do other
frameworks handle string encodings etc... Frameworks like Java, Qt
etc... Can't we learn something from them as well? Both Java and Qt
run on multiple platforms, read/write to
Nonetheless a type to hold a single character needs to exist. And same
needs to be a 32 bit type if you want to store more than 2^16 different
values (as possible with UTF-8 and UTF-16 but not with UCS-2.
-Michael
___
fpc-devel maillist -
Graeme Geldenhuys wrote:
Has anybody else got sample test code that clearly shows the claimed
significant speed gain in using UTF-16 for Windows API's? If so,
could you please post the code and your comparative results (timing
values). I think most people perception was that ANSI API's will
In our previous episode, Michael Schnell said:
Is UTF-16 Widestring in FPC (and Delphi 200x ? ) not done just ignoring the
surrogates ?
Lets hope not,
I don't think, full UTF-16 really would be desirable desirable over UC-2.
Imagine you have a string of some million characters (e.g.
On Fri, 26 Sep 2008 13:20:57 +0200
Michael Schnell [EMAIL PROTECTED] wrote:
Nonetheless a type to hold a single character needs to exist. And
same needs to be a 32 bit type if you want to store more than 2^16
different values (as possible with UTF-8 and UTF-16 but not with
UCS-2.
Some
Hi,
Yes I know we have had lengthy discussions about this before.
Everybody (whoever they might be) keeps saying that UTF-16 was chosen
for Tiburon's UnicodeString because it makes significant speed gains
when calling the Windows API based on UTF-16 - compared to the ANSI
API's. The whole debate
Graeme Geldenhuys schrieb:
Hi,
Yes I know we have had lengthy discussions about this before.
Everybody (whoever they might be) keeps saying that UTF-16 was chosen
for Tiburon's UnicodeString because it makes significant speed gains
when calling the Windows API based on UTF-16 - compared to the
Hello Graeme,
Thursday, September 25, 2008, 9:50:04 PM, you wrote:
GG Yes I know we have had lengthy discussions about this before.
GG Everybody (whoever they might be) keeps saying that UTF-16 was chosen
GG for Tiburon's UnicodeString because it makes significant speed gains
GG when calling the
65 matches
Mail list logo