Marcin 'Qrczak' Kowalczyk wrote:
> > What if something occur in the file and does not form a valid, say,
> > UTF-16 sequence?
>
> It's clearly invalid in the specs, so there would be an error detected.
> But '\0' characters are valid UTF-8, so the only reason to disallow them
> could be laziness, a
Dnia pon 7. lipca 2003 05:46, Wu Yongwei napisał:
> What if something occur in the file and does not form a valid, say,
> UTF-16 sequence?
It's clearly invalid in the specs, so there would be an error detected. But
'\0' characters are valid UTF-8, so the only reason to disallow them could be
la
Jungshik Shin wrote:
> > It's unnecessary to handle ALL cases. You could address only issues
> > encountered/expected by your end users. IMHO, it is more important
> > to make an application be light-weight and run in 99% cases. Or, you
> > may find your language used by, say, 1 people, and
On Mon, 7 Jul 2003, Wu Yongwei wrote:
> > > I wonder, how many people really want to use Unicode codepoints
> beyond
> > > U+?
> >
> > I don't want to make it incorrect by design just because cases it
> doesn't
> > handle are rare.
>
> It's unnecessary to handle ALL cases. You could address o
> > I wonder, how many people really want to use Unicode codepoints
beyond
> > U+?
>
> I don't want to make it incorrect by design just because cases it
doesn't
> handle are rare.
It's unnecessary to handle ALL cases. You could address only issues
encountered/expected by your end users. IMHO
Hi, Bruno, it was a FYI. Is providing some extra information (since
Marcin mentioned he wanted his language to be portable to Windows)
wrong? I was not arguing which format string was more reasonable, or
standards-conformant. Just some information in case one wants one's
application portable to
Dnia sob 5. lipca 2003 02:56, srintuar26 napisał:
> Old-fashioned character indexing is a dead-end road. Dont even bother
> with it. Why use unicode at all if you plan to write code thats
> guaranteed to break for other languages?
>
> Bite the bullet and deal with text as sequences, youll have to
On Fri, 4 Jul 2003, Marcin 'Qrczak' Kowalczyk wrote:
> > > ...With UTF-8 it's much worse...
> > Not actually by much, unless you're lucky enough to be able to decide that
> > a following accent never alters which class a character falls in. (What
> > if the user wants to write a "has no accents" p
Dnia pią 4. lipca 2003 19:44, Henry Spencer napisał:
> > with UTF-32 you can write a predicate for "allowed as the first character
> > in identifier" and "allowed in the rest of identifier", and take as many
> > characters as satisfy the predicate. With UTF-8 it's much worse...
>
> Not actually by
On Thu, 3 Jul 2003, Marcin 'Qrczak' Kowalczyk wrote:
> - Character predicates must work on strings and it's not obvious what part of
> the string to feed to them. For example in a compiler/interpreter with
> UTF-32 you can write a predicate for "allowed as the first character in
> identifier"
Wu Yongwei wrote:
> FYI, what MSDN says about the types in printf/wprintf:
This is not only irrelevant, it is also wrong. "%S" is less portable than
"%ls". The standard directive for wchar_t* strings is "%ls". And your cited
description of "%s" has two errors.
Read the Linux wprintf(3) manual pag
Dnia czw 3. lipca 2003 20:48, Bruno Haible napisał:
> You find details here:
[...]
Thanks! This is what I was looking for.
> Since strings are immutable in your language, you can also represent
> strings as UCS-2 or ISO-8859-1 if possible; this saves 75% of the memory
> in many cases, at the cos
Dnia pią 4. lipca 2003 04:25, Wu Yongwei napisał:
> I wonder, how many people really want to use Unicode codepoints beyond
> U+?
I don't want to make it incorrect by design just because cases it doesn't
handle are rare.
> However, UTF-16 is harder to process than UTF-8 (I think you are wron
FYI, what MSDN says about the types in printf/wprintf:
%s String When used with printf functions, specifies a
single-byte–character string; when used with wprintf functions,
specifies a wide-character string.
%S String When used with printf functions, specifies a wide-character
string; whe
Marcin 'Qrczak' Kowalczyk wrote:
I'm pretty sure that all Windows versions since Win95 allow to
convert between UTF-16 and the encoding used for filenames, and WinNT
allows to use UTF-16 filenames directly. So for filenames and the
like, texts on screen, texts exchanged with databases etc. any
Unic
Marcin 'Qrczak' Kowalczyk wrote on 2003-07-03:
> Dnia czw 3. lipca 2003 20:13, Beni Cherniavsky napisał:
>
> > > What kind of string processing UTF-8 makes simpler than UTF-32?
> >
> > Any processing that works now on ASCII and doesn't break UTF-8
> > sequences in the middle.
>
> It's not simpler,
Dnia czw 3. lipca 2003 20:13, Beni Cherniavsky napisał:
> > What kind of string processing UTF-8 makes simpler than UTF-32?
>
> Any processing that works now on ASCII and doesn't break UTF-8
> sequences in the middle.
It's not simpler, it's the same. Since the language is not C, it doesn't
matte
Hi Marcin,
> Most languages take 3, as I understand Perl it takes the mix of 3 and 2,
> and Python has both 3 and 1. I think I will take 1, but I need advice: -
Don't look at Perl in this case - Perl has the handicap that for historical
reasons it cannot make a clear distinctions between byte arr
Marcin 'Qrczak' Kowalczyk wrote on 2003-07-03:
> Dnia czw 3. lipca 2003 19:02, srintuar26 napisa³:
>
> > Well, for C++, white space are ' ', '\r', '\n', '\t'; its totally trivial.
>
> Replace "whitespace" with "an arbitrary character predicate", e.g. for finding
> the end of an identifier.
>
> > I
Dnia czw 3. lipca 2003 19:05, Beni Cherniavsky napisał:
> > I'm more afraid of requiring headaches from other people trying to
> > interface to C libraries.
>
> Please elaborate - interface from what to which C libraries? Do you
> mean here people writing C extensions to your langauge?
Yes. Of c
Dnia czw 3. lipca 2003 19:02, srintuar26 napisał:
> Well, for C++, white space are ' ', '\r', '\n', '\t'; its totally trivial.
Replace "whitespace" with "an arbitrary character predicate", e.g. for finding
the end of an identifier.
> If you want to iterate over a string, dont use single codepoi
Dnia czw 3. lipca 2003 18:50, Beni Cherniavsky napisał:
> Such programs would break anyway, with non-spacing marks and wide
> characters.
But they would at least work with ISO-8859-x files. Not all programs are
expected to be used with all possible encodings, and using UTF-32 internally
would n
Marcin 'Qrczak' Kowalczyk wrote on 2003-07-03:
> Dnia czw 3. lipca 2003 18:17, Beni Cherniavsky napisa³:
>
> > How much custom runtime libraries are you ready to write and use
> > (vs. compiling to calls to standard libraries)?
>
> I'm prepared to interface to iconv and such, and to different I/O
> I will consider this. Here are disadvantages:
>
> With UTF-8 it's much worse, you must in
> essence decode UTF-8 on the fly. You can't even implement "split on
> whitespace" without UTF-8 decoding, because you don't know what part of the
> string to test whether it's a whitespace chara
Dnia czw 3. lipca 2003 18:17, Beni Cherniavsky napisał:
> How optimized should it be?
Like Lisp or Smalltalk, i.e. quite optimized as for a dynamically typed
language, but not as much as usually are statically typed languages.
> How much custom runtime libraries are you ready to write and use
>
Marcin 'Qrczak' Kowalczyk wrote on 2003-07-03:
> - Simple one-time-use programs which assume that characters are what you get
> when you index strings (which break paragraphs or draw ASCII tables or count
> occurrences of characters) are broken more often. They work only for ASCII,
> where w
srintuar26 wrote on 2003-07-03:
> Another suggestion for your language: keep the source code in the same
> encoding as your strings. If you want a UTF-32 language, then require
> all source code to be encoded in that same encoding as well. Allow
> identifiers, comments, and literals to all be in t
Dnia czw 3. lipca 2003 17:52, srintuar26 napisał:
> Its thousands of times easier to use utf-8,
I will consider this. Here are disadvantages:
- Character predicates must work on strings and it's not obvious what part of
the string to feed to them. For example in a compiler/interpreter with
U
Marcin 'Qrczak' Kowalczyk wrote on 2003-07-03:
> I'm designing and implementing a programming language for fun. I
> must decide how strings should be handled.
>
> The language is most similar to Dylan, but let's assume its purpose
> will be like Python's. It will have a compiler which produces C c
> - Are there other reasonable ways?
> - Is it good to use UTF-32 on Unix and UTF-16 on Windows?
I'd advise against it. There is really no advantage to using UTF-32
anymore. It wont tell you where its safe to break strings, you can still
have invalid sequences, some characters will still require
I'm designing and implementing a programming language for fun. I must decide
how strings should be handled.
The language is most similar to Dylan, but let's assume its purpose will be
like Python's. It will have a compiler which produces C code and should work
at least on Linux and Windows.
Wh
31 matches
Mail list logo