Re: Why Work at Encoding Level?

2015-10-13 Thread Philippe Verdy
Speed is not much linked to the in-memory buffer sizes (memory is cheap now and cumfortable) and parsing in memory encodings is extremely fast. The actual limitation is in I/O (network or storage on disk), and at this level you work with network datagrams/packets, or disk buffers or memory pages fo

Re: Why Work at Encoding Level?

2015-10-13 Thread Daniel Bünzli
Le mardi, 13 octobre 2015 à 23:37, Richard Wordingham a écrit : > If you are referring to indexing, I suspect the issue is performance. > UTF-32 feels wasteful, and if the underlying character text is UTF-8 or > UTF-16 we need an auxiliary array to convert character number to byte > offset if we ar

Why Work at Encoding Level?

2015-10-13 Thread Richard Wordingham
On Tue, 13 Oct 2015 16:09:16 +0100 Daniel Bünzli wrote (under topic heading 'Counting Codepoints') > I don't understand why people still insist on programming with > Unicode at the encoding level rather than at the scalar value level. > Deal with encoding errors and sanitize your inputs at the IO

Re: Counting Codepoints

2015-10-13 Thread Richard Wordingham
On Tue, 13 Oct 2015 12:17:43 +0200 Philippe Verdy wrote: > 2015-10-13 8:36 GMT+02:00 Richard Wordingham < > richard.wording...@ntlworld.com>: > > For > > example, a MSKLC keyboard will deliver a supplementary character in > > two WM_CHAR messages, one for the high surrogate and one for the low

Re: Counting Codepoints

2015-10-13 Thread Richard Wordingham
On Tue, 13 Oct 2015 15:23:36 + David Starner wrote: > A UTF-16 string could delete one surrogate, or add a fractional > character. A Unicode string (not a "UTF-16 string"), which could be > stored internally in, say, a Python-like format which is Latin-1, > UCS-2, or UTF-32, conversions made

Re: Counting Codepoints

2015-10-13 Thread Richard Wordingham
On Tue, 13 Oct 2015 14:08:28 +0200 Mark Davis ☕️ wrote: > On Tue, Oct 13, 2015 at 8:36 AM, Richard Wordingham < > richard.wording...@ntlworld.com> wrote: > > Rather the question must be the unwieldy one of how > > many scalar values and lone surrogates it contains in total. > ​That may be the q

Re: Counting Codepoints

2015-10-13 Thread David Starner
On Mon, Oct 12, 2015 at 11:42 PM Richard Wordingham < richard.wording...@ntlworld.com> wrote: > On Mon, 12 Oct 2015 23:35:32 + > David Starner wrote: > > > Thus a Unicode string simply can't be in UTF-16 format > > internally with unpaired surrogates; a Unicode string in a programmer > > opaq

Re: Counting Codepoints

2015-10-13 Thread Daniel Bünzli
Le mardi, 13 octobre 2015 à 15:46, Doug Ewell a écrit : > I've been bemused by all this discussion about how unpaired surrogates > are supposed to behave I don't understand why people still insist on programming with Unicode at the encoding level rather than at the scalar value level. Deal with e

Re: Counting Codepoints

2015-10-13 Thread Doug Ewell
Richard Wordingham wrote: > You're assuming that the source of the non-conformance is external to > the program. In the case that has caused me to ask about lone > surrogates, they were actually caused by a faulty character deletion > function within the program itself. I've been bemused by all t

Re: Counting Codepoints

2015-10-13 Thread Philippe Verdy
This works in Java because Java also treats surrogates as characters, even if it has additional APIs to test strings for their actual encoding length for Unicode. But outside strings, characters are just integers mathing their code point value, and are not restricted to be valid Unicode characters

Re: Counting Codepoints

2015-10-13 Thread Mark Davis ☕️
On Tue, Oct 13, 2015 at 8:36 AM, Richard Wordingham < richard.wording...@ntlworld.com> wrote: > Rather the question must be the unwieldy one of how > many scalar values and lone surrogates it contains in total. > ​That may be the question in theory; in practice no programming language is going to

Re: Counting Codepoints

2015-10-13 Thread Philippe Verdy
2015-10-13 8:36 GMT+02:00 Richard Wordingham < richard.wording...@ntlworld.com>: > For > example, a MSKLC keyboard will deliver a supplementary character in > two WM_CHAR messages, one for the high surrogate and one for the low > surrogate. > I have not tested the actual behavior in 64-bit version