In our previous episode, Michael Schnell said:
> > Either you have UTF-8 with surrogates, or you have ASCII (since UTF-8
> > without surrogates means that only char 0..127 are valid, which is ASCII)
> In another post surrogate pairs have been denoted as a specialty of a 16
> Bit coding (UCS-2), an
On 11/18/2010 02:31 PM, Marco van de Voort wrote:
Either you have UTF-8 with surrogates, or you have ASCII (since UTF-8
without surrogates means that only char 0..127 are valid, which is ASCII)
In another post surrogate pairs have been denoted as a specialty of a 16
Bit coding (UCS-2), and I di
In our previous episode, Michael Schnell said:
> > found by a dumb byte/char scan; only few encodings have to be
> > recognized and handled, based on the char size: MBCS (UTF-8...),
> > WideChars (UTF-16/UCS2) and UTF-32.
> >
> In fact I suppose that for UTF-8 ("pure UTF-8" without surrogates) po
On 11/18/2010 12:33 AM, Hans-Peter Diettrich wrote:
Separator characters can be assumed as ASCII, so that they can be
found by a dumb byte/char scan; only few encodings have to be
recognized and handled, based on the char size: MBCS (UTF-8...),
WideChars (UTF-16/UCS2) and UTF-32.
In fact I su
Marco van de Voort schrieb:
It's an users own choice to not be unicode compliant in his apps (e.g. if he
knows he never goes to the Eastern Asiatic market etc), but a runtime should
be as unicode compliant as reasonably possible.
IMO there exist levels of compliance.
The bottom level supplies
On 17 Nov 2010, at 13:44, Michael Schnell wrote:
In fact I was not aware of the UTF-16 coding scheme. I _supposed_ it
would work similar as UTF-8 (highest bit set => 32 bit value
composed from the 31 remaining bits of this and the next word and
bit 31 reset) and thus could be decoded algor
On 11/17/2010 01:32 PM, Marco van de Voort wrote:
Regarding OS X, iirc I saw a mention somewhere that some components of Mac
OS X prefer decomposed characters. (aka UTF-8Mac).
In another forum I saw this mentioned as surrogate pairs. Sorry for the
confusion :(.
-Michael
___
On 11/17/2010 01:20 PM, Jonas Maebe wrote:
Surrogate pairs have nothing to do with Mac OS X. Surrogate pairs are
required when encoding any codepoint in UTF-16 whose UTF32 value is >=
$1.
In fact I was not aware of the UTF-16 coding scheme. I _supposed_ it
would work similar as UTF-8 (hi
In our previous episode, Michael Schnell said:
> > It is not viable not to. Either you implement unicode or not.
> >
> > It's an users own choice to not be unicode compliant in his apps (e.g. if he
> > knows he never goes to the Eastern Asiatic market etc), but a runtime should
> > be as unicode co
On 17 Nov 2010, at 12:23, Michael Schnell wrote:
Regarding that handling surrogate pairs needs tables while UTF/UCS
handling can be done by simple algorithms and that (AFAIK) surrogate
pairs are used only in certain environments (Mac and what else ?)
Surrogate pairs have nothing to do with
On 11/17/2010 12:02 PM, Marco van de Voort wrote:
In our previous episode, Michael Schnell said:
Only the ones where surrogates really matter.
Is is really viable to have the compiler/RTL try to automatically handle
these ugly beasts,
It is not viable not to. Either you implement unicode or no
In our previous episode, Michael Schnell said:
> >
> > Only the ones where surrogates really matter.
> Is is really viable to have the compiler/RTL try to automatically handle
> these ugly beasts,
It is not viable not to. Either you implement unicode or not.
It's an users own choice to not be
On 11/17/2010 10:12 AM, Marco van de Voort wrote:
Only the ones where surrogates really matter.
Is is really viable to have the compiler/RTL try to automatically handle
these ugly beasts, rather than presenting them to the poor user as two
separate Unicode characters (and only handle the UTC/U
On 11/15/2010 01:24 PM, Marco van de Voort wrote:
Typically I'd iterate by means outside the language (I've used simple iterators
based on a record with a few inline methods in the past), and review the
places where you iterate by char through strings, and reduce it
signficantly.
Since the latt
In our previous episode, Hans-Peter Diettrich said:
> >
> > I don't consider it an extreme, on the contrary. Trying to fix this is
> > extreme IMHO.
>
> Sorry, I understood that you want to replace all for loops by iterated
> loops.
Only the ones where surrogates really matter.
> >>> And in
Marco van de Voort schrieb:
In our previous episode, Hans-Peter Diettrich said:
Yes, but the realisation should be that the holding on array indexing is
what makes it expensive. The problem could be strongly reduced by removing
such array indexing skeleton from all routines where it is not neces
In our previous episode, Hans-Peter Diettrich said:
> > Yes, but the realisation should be that the holding on array indexing is
> > what makes it expensive. The problem could be strongly reduced by removing
> > such array indexing skeleton from all routines where it is not necessary.
>
> Why fall
Marco van de Voort schrieb:
Yes, but the realisation should be that the holding on array indexing is
what makes it expensive. The problem could be strongly reduced by removing
such array indexing skeleton from all routines where it is not necessary.
Why fall from one extreme into the other one
On Tue, 16 Nov 2010, Marco van de Voort wrote:
Furthermore I think that in detail Unicode string handling should not be
based on single characters at all, but instead should use (sub)strings
all over, covering multibyte character representations, ligatures etc.
as well
This is dog slow. You
In our previous episode, Hans-Peter Diettrich said:
> > First you would have to come up with a workable model for s[x] being
> > utf32chars in general that doesn't suffer from O(N^2) performance
> > degradation (read/write)
>
> Right, UTF-32 or UCS2 were much more useful in computations.
I said s
Alexander Klenin schrieb:
The total order will be something between O(n^1) and O(n^2), depending on
many factors (what is "n"?...).
Huh? O(f(n)) has a precise definition, and of course we are talking worst-case
complexity here (although average complexity would be the same in this case).
n is
Marco van de Voort schrieb:
First you would have to come up with a workable model for s[x] being
utf32chars in general that doesn't suffer from O(N^2) performance
degradation (read/write)
Right, UTF-32 or UCS2 were much more useful in computations.
And for it to be useful, it must be workabl
On Tue, Nov 16, 2010 at 01:50, Hans-Peter Diettrich
wrote:
>> The other of the algorithm is then still O(n^2), since UTF8Char will
>> already be O(n)?
>
> The total order will be something between O(n^1) and O(n^2), depending on
> many factors (what is "n"?...).
Huh? O(f(n)) has a precise definit
On Mon, Nov 15, 2010 at 11:21 AM, Michael Schnell wrote:
> ..forces the programmer to work with both UTF-8 and UCS32 coded Unicode
> characters. This might blow his mind even more (regarding that e.g. the
> Lazarus LCL forces him to work with UTF-8 coded Unicode in a string type
> called "ANSIStri
In our previous episode, Hans-Peter Diettrich said:
> >> At least the example code has to be made work, i.e. the nonsense statement
> >>DoSomething(ch(i));
> >> has to be changed into something like
> >>DoSomething(GetUTF8char(s,i));
> >> before we can can talk honestly about the order of t
Marco van de Voort schrieb:
At least the example code has to be made work, i.e. the nonsense statement
DoSomething(ch(i));
has to be changed into something like
DoSomething(GetUTF8char(s,i));
before we can can talk honestly about the order of the loop.
The other of the algorithm is then
On 15-11-2010 10:22, Vincent Snijders wrote:
Maybe I did not understand Thaddy, but to give you O(1) access to the
ith character, I was thinking about a a translation table of the utf8
string, with key=index (1..length) and value=offset in bytes to the
ith character. Such a translation table wou
In our previous episode, Michael Schnell said:
> > No, since that wouldn't describe the position of that char in the string
> > that is being iterated.
> Is this really wanted ?
>
> I suppose this would ask for a full blown iterator
Typically I'd iterate by means outside the language (I've us
On 11/15/2010 11:40 AM, Marco van de Voort wrote:
No, since that wouldn't describe the position of that char in the string
that is being iterated.
Is this really wanted ?
I suppose this would ask for a full blown iterator
-Michael
___
fpc-devel
In our previous episode, Michael Schnell said:
> > The comparison in the UTF-8 string example is very questionable. First
> > ch(i) is not equivalent to ch, not even closely related, and the claim
> > of O(N^2) operations deserves an proof - IMO it's simply wrong.
>
> With UTF-8 strings and frie
On 11/15/2010 11:20 AM, Vincent Snijders wrote:
I agree, and that is why you need enumerators to make it work.
OK, in fact this _is_ an implementation of an enumerator, but same is
hidden and so the application programmer is not forced to bother. He
just sees the Unicode character in the loop
unintentionally deleted;
..forces the programmer to work with both UTF-8 and UCS32 coded Unicode
characters. This might blow his mind even more (regarding that e.g. the
Lazarus LCL forces him to work with UTF-8 coded Unicode in a string type
called "ANSIString" :( )
-Michael
2010/11/15 Michael Schnell :
> On 11/15/2010 10:22 AM, Vincent Snijders wrote:
>>
>> I cannot imagine another way that a translations table can give you o(1)
>> access.
>>
> Maybe I don't understand the o(1) correctly. Do you think it should be
> necessary to access each character in the string wit
On 11/15/2010 10:10 AM, Alexander Klenin wrote:
Actually, I do not think so. I believe that an integer containing the codepoint
is preferable implementation.
OK, Unicode always blows up the complexity of the code greatly ;).
Your suggestion would result in an UTF-8 -> UCS32 translation and thu
On 11/15/2010 10:22 AM, Vincent Snijders wrote:
I cannot imagine another way that a translations table can give you o(1) access.
Maybe I don't understand the o(1) correctly. Do you think it should be
necessary to access each character in the string with in each iteration
in this way.
What I
2010/11/15 Michael Schnell :
> On 11/14/2010 03:33 PM, Vincent Snijders wrote:
>>
>> I did not have in mind such a sophisticated UTF8 string
>> implementation, that included a translation table for easy indexing.
>
> I don't think you need a translation table to walk through an UTF-8 String
Maybe
On Mon, Nov 15, 2010 at 18:38, Michael Schnell wrote:
> On 11/13/2010 08:56 PM, Hans-Peter Diettrich wrote:
>>
>>
>> The comparison in the UTF-8 string example is very questionable. First
>> ch(i) is not equivalent to ch, not even closely related, and the claim of
>> O(N^2) operations deserves an
On 11/14/2010 10:12 PM, Hans-Peter Diettrich wrote:
With regards to UTF-8 (or other MBCS) strings, what does Length(s)
return in these cases? IMO other functions have to be used for the
determination of the true character count (as opposed to the char=byte
count).
Of course its possible without
On 11/14/2010 03:33 PM, Vincent Snijders wrote:
I did not have in mind such a sophisticated UTF8 string
implementation, that included a translation table for easy indexing.
I don't think you need a translation table to walk through an UTF-8
String Unicode-Character by Unicode-Character (and cre
On 11/14/2010 09:47 PM, Hans-Peter Diettrich wrote:
I wonder how FPC defines low() and high() for sets.
IMHO it should not. An "in" loop on sets should not use a defined
sequence. Relying on on an "order" of the elements of a set
mathematically is erroneous.
-Michael
__
On 11/13/2010 08:56 PM, Hans-Peter Diettrich wrote:
The comparison in the UTF-8 string example is very questionable. First
ch(i) is not equivalent to ch, not even closely related, and the claim
of O(N^2) operations deserves an proof - IMO it's simply wrong.
With UTF-8 strings and friends wo
On Mon, Nov 15, 2010 at 08:25, Marco van de Voort wrote:
> In our previous episode, Hans-Peter Diettrich said:
>> At least the example code has to be made work, i.e. the nonsense statement
>> DoSomething(ch(i));
>> has to be changed into something like
>> DoSomething(GetUTF8char(s,i));
>> be
On Sun, Nov 14, 2010 at 08:52, Graeme Geldenhuys
wrote:
> If you use full-blown Iterator classes (instead of just for-in style)
> you get a lot more too:
>
> * full control over iteration
> - move forward
> - move back
> - reset iteration
> - peek forward/back
> - skip, etc...
> * you
In our previous episode, Hans-Peter Diettrich said:
> > the O(N^2) stems from the fact that it is hard to get the ith
> > character in a a UTF8String in O(1). Suppose it is o(N), then the loop
> > is O(n^2).
>
> With regards to UTF-8 (or other MBCS) strings, what does Length(s)
The base size of
In our previous episode, Hans-Peter Diettrich said:
>
> > A more grave reason though is that Delphi does not have low() and high() on
> > sets and a request to add it by me in 2006 was closed with their equivalent
> > of "won't fix".
>
> I wonder how FPC defines low() and high() for sets.
See th
Vincent Snijders schrieb:
2010/11/14 Thaddy :
On 13-11-2010 20:56, Hans-Peter Diettrich wrote:
The comparison in the UTF-8 string example is very questionable. First
ch(i) is not equivalent to ch, not even closely related, and the claim of
O(N^2) operations deserves an proof - IMO it's simply w
Marco van de Voort schrieb:
A more grave reason though is that Delphi does not have low() and high() on
sets and a request to add it by me in 2006 was closed with their equivalent
of "won't fix".
I wonder how FPC defines low() and high() for sets. The static bounds
can be obtained from the un
Thaddy schrieb:
The comparison in the UTF-8 string example is very questionable. First
ch(i) is not equivalent to ch, not even closely related, and the claim
of O(N^2) operations deserves an proof - IMO it's simply wrong.
Yes, this caught my eye as well: O(N^2) seems only the case if "length"
In our previous episode, Thaddy said:
> > would be evaluated every time. S
> > the O(N^2) stems from the fact that it is hard to get the ith
> > character in a a UTF8String in O(1). Suppose it is o(N), then the loop
> > is O(n^2).
> >
> "Hard to" is implementation detail and not part of any algorit
2010/11/14 Thaddy :
> On 14-11-2010 13:22, Vincent Snijders wrote:
>>
>> would be evaluated every time. S
>> the O(N^2) stems from the fact that it is hard to get the ith
>> character in a a UTF8String in O(1). Suppose it is o(N), then the loop
>> is O(n^2).
>>
>> Vincent
>
> "Hard to" is implement
On 14-11-2010 13:22, Vincent Snijders wrote:
would be evaluated every time. S
the O(N^2) stems from the fact that it is hard to get the ith
character in a a UTF8String in O(1). Suppose it is o(N), then the loop
is O(n^2).
Vincent
"Hard to" is implementation detail and not part of any algorithm.
2010/11/14 Thaddy :
> On 13-11-2010 20:56, Hans-Peter Diettrich wrote:
>>
>> The comparison in the UTF-8 string example is very questionable. First
>> ch(i) is not equivalent to ch, not even closely related, and the claim of
>> O(N^2) operations deserves an proof - IMO it's simply wrong.
>>
> Yes,
In our previous episode, Thaddy said:
> > The comparison in the UTF-8 string example is very questionable. First
> > ch(i) is not equivalent to ch, not even closely related, and the claim
> > of O(N^2) operations deserves an proof - IMO it's simply wrong.
> >
> Yes, this caught my eye as well: O(
On 13-11-2010 20:56, Hans-Peter Diettrich wrote:
The comparison in the UTF-8 string example is very questionable. First
ch(i) is not equivalent to ch, not even closely related, and the claim
of O(N^2) operations deserves an proof - IMO it's simply wrong.
Yes, this caught my eye as well: O(N^
On 13 November 2010 23:32, Sven Barth wrote:
> On 13.11.2010 20:56, Hans-Peter Diettrich wrote:
>>
>> In general, what's the benefit of using enumerators? IMO a for loop
>> executes faster on (linear) string and array types, where enumerator
>> calls occur in for-in (see also my note on the UTF-8
On 13.11.2010 20:56, Hans-Peter Diettrich wrote:
In general, what's the benefit of using enumerators? IMO a for loop
executes faster on (linear) string and array types, where enumerator
calls occur in for-in (see also my note on the UTF-8 string example).
I'd say they simplify the code. They mi
Marco van de Voort schrieb:
we have placed a new major release of the Free Pascal
Compiler, version 2.4.2 on our ftp-servers.
Great :-)
Some highlights are:
Compiler:
* Support D2006+ FOR..IN, with some FPC specific enhancements. Refer to
http://wiki.freepascal.org/for-in_loop for m
57 matches
Mail list logo