Re: Cyrrilic support is broken?

Yves Jaradin Tue, 18 Aug 2009 06:15:16 -0700

Boriss Mejias wrote :

Hi all,

Characters from other western language are not supported either. Ihaven't noticed it until now even though I'm Spanish native speaker, butI'm so used to lacking support that I didn't even try until now.


Anyway... Here are the test I just made

{Browse "á"}  -> [195 177] two codes
{Browse "ñ"}  -> [195 161] two codes two, which is not the same as
{Browse "n~"} -> [110 126]

I assume 195 is a composition code

Then, the last two tests:

{Browse &a}  -> 97
{Browse &á}  -> error

%************************** lexical error ***********************
%**
%** illegal character
%**
%** in file "Oz", line 7, column 10
%** ------------------ rejected (1 error)

cheers
Boriss

Here is my understanding of it.

Basically, Oz doesn't really use iso8859-1 or any other encoding. Fromthe emulator point of view there are only bytes. For these bytes to beinterpreted as text a charset/encoding is needed. The followingcharsets/encodings are of interest:

1) The one of the console (for console oriented Oz programs)
2) The one that Tk uses (for Tk and QTk based Oz GUIs)
3) The one your source Oz file is written in

4) The one the C library will use (This one might potentially depends onsettings from compile-time of the virtual machine and from settings atthe runtime of it)

The only real constraint is that 3) must be ASCII-compatible. This meansthat (most) byte values between 0 and 127 must have ASCII semantics(e.g. 65=A). If not you might have difficulties writing keywords thatthe compiler will recognize! It also means that byte values with valuebetween 0 and 127 should never be interpreted differently even if in aspecial sequence or preceded by some "shifting" character etc. since theVM interprets files as simple streams of bytes and will recognize an Ozkeyword even in a suppsoedly "shifted" state.

In practice the following charsets/encodings should be usable for Ozsource code:

all ISO 8859-x (including ISO 8859-5)
all KOI8 encodings (including KOI8-R and KOI8-U)
Most DOS and Windows codepages (including CP850 and CP1251)
UTF8
EUC-JP

These are definitely unusable:
UTF7 (at least it would be extremely difficult to use)
UTF16 (lots of embedded 0 bytes)
UTF32 (idem)

Most (all?) versions of EBCDIC (not ASCII compatible you would need towrite @ for | and even letters would need transpositions!)ISO2022 (you could have some Japanese text being interpreted as aclosing quote, a keyword, etc.)

Of course there are still some restrictions. Since the machineinterprets files as streams of bytes, the semantic of the &x constructis value of the byte following the byte with value 126 (& in ASCII) Ifthis is part of a multibyte character, the first byte will be taken andthe remaining ones are most likely to lead to a syntax error.

Charset/encoding 4) determines the semantic of operations in the Charmodule (I think) and can be otherwise ignored.

If the charsets/encodings 1), 2) and 3) are not all the same, yourapplication might need to do some explicit conversions.

In case you need to do explicit conversion or to have operations of thetype provided by the Char module but for another charset/encoding than4), the easiest way will be to create a binding to some Unicode librarysuch as libICU or to use one of the project already mentionned.

Emacs should ask for the charset/encoding when saving a file which isnot pure ASCII. In Linux, most recent distribution use UTF8 for the console.

You should be able to decide what are all these charsets/encodings byexperimenting with strings made explicitly of integers (such as [72105]="Hi" in ASCII) according to the potential charsets/encodings.


Yves

Dmitry Negius wrote:

I have done like you said under my Windows XS Service Pack 3 computer.
I have written and compiled next:

functor

import
 Application
 System
define
 {System.showInfo "АБВГД"}
 {Application.exit 0}
end

The output from this program is pseudo-graphic trash - not the Russianletters.

So compiler or interpreter or both has erorrs in the cyrillic support.
This not means that Mozart OPI is correct. Mozart OPI has mistake also
because written in the FAR Commander editor Oz program with correctly
displayed russian letters is wrong displayed in the Mozart OPI.

2009/8/17 Wolfgang Meyer <[email protected]<mailto:[email protected]>>


    Hi,

    actually, ASCII only defines the codes 0-127.
    Oz uses the ISO/IEC 8859-1 charset, which covers Western European
    languages.
    However, as long as you only use normal input and output and no GUI,
    it might still work with Cyrillic symbols on a Computer which uses a
    Cyrillic codepage.

    To test whether the problem is with Emacs or with Oz, you could
    write a little program like

    functor
    import
     Application
     System
    define
     {System.showInfo "some Cyrillic text"}
     {Application.exit 0}
    end

    Compile it with "oz -c filename.oz" and execute it in a
    shell/DOS-Box with "ozengine filename.ozf".

    If this works, we know that the problem is either with Emacs or with
    the Oz-Emacs-interface.

    Cheers,
     Wolfgang

> Cyrillic symbols are situated in the high part of ASCII tableand has

     > codes
     > lower then 256.
     > Oz program with cyrillic symbols is ASCII text, but is wrong
    displayed by
     > the Emacs OPI.
     > Question is still open :-)
     >
     > 2009/8/17 Torsten Anders <[email protected]
    <mailto:[email protected]>>
     >
     > >  Dear Dmitry,
     > >
     > >   On 17 Aug 2009, at 14:25, Dmitry Negius wrote:
     > >
     > > Hello.
     > > I study Mozart - Oz now and found a problem with Cyrillic input
    in the
     > > Emacs OPI.
     > > Both 3 cyrillic inputs modes does not work in the Emacs - input
    letters
     > are
     > > displaed
     > > incorrectly.
     > >

> > Is it Mozart or Emacs bug and is there workaround of thisproblem?

     > >
     > >

> > As far as I know, Mozart source must be ASCII. Unicodesupport was

     > > discussed before (check the mailing list archive) but not part
    for the
     > > language. Mozart extensions for Unicode are proposed by
     > >
     > >     * http://www.snowlion.nl/mozart/
     > >     * http://www.mozart-oz.org/mogul/info/fkonvick/unicode.html
     > >
     > > Hope I understood your question..
     > >
     > > Best
     > > Torsten
     > >

_________________________________________________________________________________
mozart-users mailing list                               
[email protected]
http://www.mozart-oz.org/mailman/listinfo/mozart-users

Re: Cyrrilic support is broken?

Reply via email to