Re: Strings in a programming language

2003-07-07 Thread Wu Yongwei
Marcin 'Qrczak' Kowalczyk wrote: > > What if something occur in the file and does not form a valid, say, > > UTF-16 sequence? > > It's clearly invalid in the specs, so there would be an error detected. > But '\0' characters are valid UTF-8, so the only reason to disallow them > could be laziness, a

Re: Strings in a programming language

2003-07-07 Thread Marcin 'Qrczak' Kowalczyk
Dnia pon 7. lipca 2003 05:46, Wu Yongwei napisał: > What if something occur in the file and does not form a valid, say, > UTF-16 sequence? It's clearly invalid in the specs, so there would be an error detected. But '\0' characters are valid UTF-8, so the only reason to disallow them could be la

Re: Strings in a programming language

2003-07-07 Thread Wu Yongwei
Jungshik Shin wrote: > > It's unnecessary to handle ALL cases. You could address only issues > > encountered/expected by your end users. IMHO, it is more important > > to make an application be light-weight and run in 99% cases. Or, you > > may find your language used by, say, 1 people, and

Re: Strings in a programming language

2003-07-06 Thread Jungshik Shin
On Mon, 7 Jul 2003, Wu Yongwei wrote: > > > I wonder, how many people really want to use Unicode codepoints > beyond > > > U+? > > > > I don't want to make it incorrect by design just because cases it > doesn't > > handle are rare. > > It's unnecessary to handle ALL cases. You could address o

Re: Strings in a programming language

2003-07-06 Thread Wu Yongwei
> > I wonder, how many people really want to use Unicode codepoints beyond > > U+? > > I don't want to make it incorrect by design just because cases it doesn't > handle are rare. It's unnecessary to handle ALL cases. You could address only issues encountered/expected by your end users. IMHO

Re: Strings in a programming language

2003-07-06 Thread Wu Yongwei
Hi, Bruno, it was a FYI. Is providing some extra information (since Marcin mentioned he wanted his language to be portable to Windows) wrong? I was not arguing which format string was more reasonable, or standards-conformant. Just some information in case one wants one's application portable to

Re: Strings in a programming language

2003-07-04 Thread Marcin 'Qrczak' Kowalczyk
Dnia sob 5. lipca 2003 02:56, srintuar26 napisał: > Old-fashioned character indexing is a dead-end road. Dont even bother > with it. Why use unicode at all if you plan to write code thats > guaranteed to break for other languages? > > Bite the bullet and deal with text as sequences, youll have to

Re: Strings in a programming language

2003-07-04 Thread Henry Spencer
On Fri, 4 Jul 2003, Marcin 'Qrczak' Kowalczyk wrote: > > > ...With UTF-8 it's much worse... > > Not actually by much, unless you're lucky enough to be able to decide that > > a following accent never alters which class a character falls in. (What > > if the user wants to write a "has no accents" p

Re: Strings in a programming language

2003-07-04 Thread Marcin 'Qrczak' Kowalczyk
Dnia pią 4. lipca 2003 19:44, Henry Spencer napisał: > > with UTF-32 you can write a predicate for "allowed as the first character > > in identifier" and "allowed in the rest of identifier", and take as many > > characters as satisfy the predicate. With UTF-8 it's much worse... > > Not actually by

Re: Strings in a programming language

2003-07-04 Thread Henry Spencer
On Thu, 3 Jul 2003, Marcin 'Qrczak' Kowalczyk wrote: > - Character predicates must work on strings and it's not obvious what part of > the string to feed to them. For example in a compiler/interpreter with > UTF-32 you can write a predicate for "allowed as the first character in > identifier"

Re: Strings in a programming language

2003-07-04 Thread Bruno Haible
Wu Yongwei wrote: > FYI, what MSDN says about the types in printf/wprintf: This is not only irrelevant, it is also wrong. "%S" is less portable than "%ls". The standard directive for wchar_t* strings is "%ls". And your cited description of "%s" has two errors. Read the Linux wprintf(3) manual pag

Re: Strings in a programming language

2003-07-04 Thread Marcin 'Qrczak' Kowalczyk
Dnia czw 3. lipca 2003 20:48, Bruno Haible napisał: > You find details here: [...] Thanks! This is what I was looking for. > Since strings are immutable in your language, you can also represent > strings as UCS-2 or ISO-8859-1 if possible; this saves 75% of the memory > in many cases, at the cos

Re: Strings in a programming language

2003-07-03 Thread Marcin 'Qrczak' Kowalczyk
Dnia pią 4. lipca 2003 04:25, Wu Yongwei napisał: > I wonder, how many people really want to use Unicode codepoints beyond > U+? I don't want to make it incorrect by design just because cases it doesn't handle are rare. > However, UTF-16 is harder to process than UTF-8 (I think you are wron

Re: Strings in a programming language

2003-07-03 Thread Wu Yongwei
FYI, what MSDN says about the types in printf/wprintf: %s String When used with printf functions, specifies a single-byte–character string; when used with wprintf functions, specifies a wide-character string. %S String When used with printf functions, specifies a wide-character string; whe

Re: Strings in a programming language

2003-07-03 Thread Wu Yongwei
Marcin 'Qrczak' Kowalczyk wrote: I'm pretty sure that all Windows versions since Win95 allow to convert between UTF-16 and the encoding used for filenames, and WinNT allows to use UTF-16 filenames directly. So for filenames and the like, texts on screen, texts exchanged with databases etc. any Unic

Re: Strings in a programming language

2003-07-03 Thread Beni Cherniavsky
Marcin 'Qrczak' Kowalczyk wrote on 2003-07-03: > Dnia czw 3. lipca 2003 20:13, Beni Cherniavsky napisał: > > > > What kind of string processing UTF-8 makes simpler than UTF-32? > > > > Any processing that works now on ASCII and doesn't break UTF-8 > > sequences in the middle. > > It's not simpler,

Re: Strings in a programming language

2003-07-03 Thread Marcin 'Qrczak' Kowalczyk
Dnia czw 3. lipca 2003 20:13, Beni Cherniavsky napisał: > > What kind of string processing UTF-8 makes simpler than UTF-32? > > Any processing that works now on ASCII and doesn't break UTF-8 > sequences in the middle. It's not simpler, it's the same. Since the language is not C, it doesn't matte

Re: Strings in a programming language

2003-07-03 Thread Bruno Haible
Hi Marcin, > Most languages take 3, as I understand Perl it takes the mix of 3 and 2, > and Python has both 3 and 1. I think I will take 1, but I need advice: - Don't look at Perl in this case - Perl has the handicap that for historical reasons it cannot make a clear distinctions between byte arr

Re: Strings in a programming language

2003-07-03 Thread Beni Cherniavsky
Marcin 'Qrczak' Kowalczyk wrote on 2003-07-03: > Dnia czw 3. lipca 2003 19:02, srintuar26 napisa³: > > > Well, for C++, white space are ' ', '\r', '\n', '\t'; its totally trivial. > > Replace "whitespace" with "an arbitrary character predicate", e.g. for finding > the end of an identifier. > > > I

Re: Strings in a programming language

2003-07-03 Thread Marcin 'Qrczak' Kowalczyk
Dnia czw 3. lipca 2003 19:05, Beni Cherniavsky napisał: > > I'm more afraid of requiring headaches from other people trying to > > interface to C libraries. > > Please elaborate - interface from what to which C libraries? Do you > mean here people writing C extensions to your langauge? Yes. Of c

Re: Strings in a programming language

2003-07-03 Thread Marcin 'Qrczak' Kowalczyk
Dnia czw 3. lipca 2003 19:02, srintuar26 napisał: > Well, for C++, white space are ' ', '\r', '\n', '\t'; its totally trivial. Replace "whitespace" with "an arbitrary character predicate", e.g. for finding the end of an identifier. > If you want to iterate over a string, dont use single codepoi

Re: Strings in a programming language

2003-07-03 Thread Marcin 'Qrczak' Kowalczyk
Dnia czw 3. lipca 2003 18:50, Beni Cherniavsky napisał: > Such programs would break anyway, with non-spacing marks and wide > characters. But they would at least work with ISO-8859-x files. Not all programs are expected to be used with all possible encodings, and using UTF-32 internally would n

Re: Strings in a programming language

2003-07-03 Thread Beni Cherniavsky
Marcin 'Qrczak' Kowalczyk wrote on 2003-07-03: > Dnia czw 3. lipca 2003 18:17, Beni Cherniavsky napisa³: > > > How much custom runtime libraries are you ready to write and use > > (vs. compiling to calls to standard libraries)? > > I'm prepared to interface to iconv and such, and to different I/O

Re: Strings in a programming language

2003-07-03 Thread srintuar26
> I will consider this. Here are disadvantages: > > With UTF-8 it's much worse, you must in > essence decode UTF-8 on the fly. You can't even implement "split on > whitespace" without UTF-8 decoding, because you don't know what part of the > string to test whether it's a whitespace chara

Re: Strings in a programming language

2003-07-03 Thread Marcin 'Qrczak' Kowalczyk
Dnia czw 3. lipca 2003 18:17, Beni Cherniavsky napisał: > How optimized should it be? Like Lisp or Smalltalk, i.e. quite optimized as for a dynamically typed language, but not as much as usually are statically typed languages. > How much custom runtime libraries are you ready to write and use >

Re: Strings in a programming language

2003-07-03 Thread Beni Cherniavsky
Marcin 'Qrczak' Kowalczyk wrote on 2003-07-03: > - Simple one-time-use programs which assume that characters are what you get > when you index strings (which break paragraphs or draw ASCII tables or count > occurrences of characters) are broken more often. They work only for ASCII, > where w

Re: Strings in a programming language

2003-07-03 Thread Beni Cherniavsky
srintuar26 wrote on 2003-07-03: > Another suggestion for your language: keep the source code in the same > encoding as your strings. If you want a UTF-32 language, then require > all source code to be encoded in that same encoding as well. Allow > identifiers, comments, and literals to all be in t

Re: Strings in a programming language

2003-07-03 Thread Marcin 'Qrczak' Kowalczyk
Dnia czw 3. lipca 2003 17:52, srintuar26 napisał: > Its thousands of times easier to use utf-8, I will consider this. Here are disadvantages: - Character predicates must work on strings and it's not obvious what part of the string to feed to them. For example in a compiler/interpreter with U

Re: Strings in a programming language

2003-07-03 Thread Beni Cherniavsky
Marcin 'Qrczak' Kowalczyk wrote on 2003-07-03: > I'm designing and implementing a programming language for fun. I > must decide how strings should be handled. > > The language is most similar to Dylan, but let's assume its purpose > will be like Python's. It will have a compiler which produces C c

Re: Strings in a programming language

2003-07-03 Thread srintuar26
> - Are there other reasonable ways? > - Is it good to use UTF-32 on Unix and UTF-16 on Windows? I'd advise against it. There is really no advantage to using UTF-32 anymore. It wont tell you where its safe to break strings, you can still have invalid sequences, some characters will still require

Strings in a programming language

2003-07-03 Thread Marcin 'Qrczak' Kowalczyk
I'm designing and implementing a programming language for fun. I must decide how strings should be handled. The language is most similar to Dylan, but let's assume its purpose will be like Python's. It will have a compiler which produces C code and should work at least on Linux and Windows. Wh