Re: encoding vs charset

2008-07-17 Thread NotFound
On Thu, Jul 17, 2008 at 5:20 AM, Allison Randal [EMAIL PROTECTED] wrote:

 The thing is, there's a tendency for data for a particular program or
 application to all be from the same character set (if, for example, you're
 parsing a series of files, munging the data in some way, and writing out a
 series of files as a result). We never want to force all data to be
 transformed into one canonical character set, because it significantly

But that it's not my proposal. The proposal is to consider that all
texts are already unicode, just encoded in his particular way. And
there is no need to transform it unless asked, the same way that utf8
does not need to be converted to utf16 or utf32 if not asked.

But better I'll leave this discussion, and not reopen it without first
preparing a detailed proposal.

-- 
Salu2


Re: encoding vs charset

2008-07-16 Thread NotFound
On Wed, Jul 16, 2008 at 1:13 AM, Moritz Lenz
[EMAIL PROTECTED] wrote:
 NotFound wrote:
 * Unicode isn't necessarily universal, or might stop to be so in future.
 If a character is not representable in Unicode, and you chose to use
 Unicode for everything, you're screwed
 There are provision for private usage codepoints.
 If we use them in parrot, we can't use them in HLLs, right? do we really
 want that?

I don't understand that point. An HLL can use any codepoint wanted, no
matter if there ir a glyph for it in any available font. The way of
writing it in the source is not important to parrot, you just need to
emit valid pir, create a valid pbc, or whatever.

 * related to the previous point, some other character encodings might
 not have a lossless round-trip conversion.
 Did we need that? The intention is that strings are stored in the
 format wanted and not recoded without a good reason.
 But if you can't work with non-Unicode text strings, you have to convert
 them, and in the process you possibly lose information. That's why we
 want to enable text strings with non-Unicode semantics.

But the point is precisely that we don't need to take any text as non-Unicode.

 Introducing the no character set character set is just a special case
 of arbitrary character sets. I see no point in using the special case
 over the generic one.
 Because is special, and we need to deal with his speciality in any
 case. Just concatenating it with any other is plain wrong. Just
 treating it as iso-8859-1 is not taken in as plain binary at all.
 Just as it is plain wrong to concatenate strings in an two
 non-compatible character sets (unless you store the strings as trees,

Yes, and because of that the approach of considering unicode the only
character set is simpler. That way the concatenation as text any pair
of text strings has no other problem that deciding the destination
encoding.

 But the main point is that the encoding issues is complicated enough
 even inside unicode, and adding another layer of complexity will make
 it worse.
 I think that distinguishing incompatible character sets is no harder
 than distinguishing text and binary strings. It's not another layer,
 it's just a layer used in a more general way.

And what will be that way? In the current implemenation we have ascii,
iso-8859-1 and unicode charsets (not counting binary). Add another
charset, and we need a conversion to/from all this. Add another, and
sum and multiply.

With the unicode and encodings approach, adding any 8-bit or less
charset taken as unicode encoding is just adding a table of his 256
corresponding codepoints.

-- 
Salu2


Re: encoding vs charset

2008-07-16 Thread Allison Randal

Moritz Lenz wrote:

NotFound wrote:

To open another can of worms, I think that we can live without
character set specification. We can stablish that the character set is
always unicode, and to deal only with encodings. 


We had that discussion already, and the answer was no for several reasons:
* Strings might contain binary data, it doesn't make sense to view them
as Unicode
* Unicode isn't necessarily universal, or might stop to be so in future.
If a character is not representable in Unicode, and you chose to use
Unicode for everything, you're screwed
* related to the previous point, some other character encodings might
not have a lossless round-trip conversion.


Yes, we can never assume Unicode as the character set, or restrict 
Parrot to only handling the Unicode character set.



Ascii is an encoding
that maps directly to codepoints and only allows 0-127 values.
iso-8859-1 is the same with 0-255 range. Any other 8 bit encoding just
need a translation table. The only point to solve is we need some
special way to work with fixed-8 with no intended character
representation.


Introducing the no character set character set is just a special case
of arbitrary character sets. I see no point in using the special case
over the generic one.


The thing is, there's a tendency for data for a particular program or 
application to all be from the same character set (if, for example, 
you're parsing a series of files, munging the data in some way, and 
writing out a series of files as a result). We never want to force all 
data to be transformed into one canonical character set, because it 
significantly increases the cost of working in data from different 
character sets, and the chances of corrupting that data in the process. 
If someone is reading, modifying, and writing EBCDIC files, they 
shouldn't have to translate their data to an intermediate format and 
back again.


Allison


encoding vs charset

2008-07-15 Thread Leopold Toetsch
Hi,

I just saw that and such (too late) at #parrotsketch:

  21:52  NotFound So unicode:\xab and utf8:unicode:\xab is also the same 
result?

In my opinion and (AFAIK still in the implementation) is it that the encoding 
bit of PIR is how the possibly escaped bytes are specifying the codepoint in 
the _scource code_. That codepoint will then belong to some charset. Alas the 
above example is illegal.

The source encoding of that mentioned file t/op/stringu.t is utf8:

:set fenc?
  fileencoding=utf-8

pasm_output_is( 'CODE', OUTPUT, UTF8 literals );
set S0, utf8:unicode:«

and ...

pasm_output_is( 'CODE', OUTPUT, UTF8 literals );
set S0, utf8:unicode:\xc2\xab

this is valid UTF8 encoding too, as there is no collision between escaped and 
non-escaped UTF8 chars.

unicode:\ab is illegal as there is no such encoding in unicode that would 
make this a codepoint (the more that the default encoding of charset unicode 
is utf8). Or IOW if this were valid than the escaped char syntax would be 
ambiguous.

21:51  pmichaud so   unicode:«   and unicode:\xab  would produce exactly 
the same result.
21:51  pmichaud even down to being the same .pbc output.
21:51  allison pmichaud: exactly

The former is a valid char in an UTF8/iso-8859-1 encoded source file and only 
there, while the latter is a single invalid UTF8 char part. How would you 
interpret unicode:\xab\x65 then?

I think that there is still some confusion between the encoding of source code 
with the desired meaning in the charset and the internal encoding of parrot, 
which might be UCS2 or anything.

my 2 ¢
leo


Re: encoding vs charset

2008-07-15 Thread Patrick R. Michaud
On Tue, Jul 15, 2008 at 11:17:23PM +0200, Leopold Toetsch wrote:
 21:51  pmichaud so   unicode:«   and unicode:\xab  would produce 
 exactly 
 the same result.
 21:51  pmichaud even down to being the same .pbc output.
 21:51  allison pmichaud: exactly
 
 The former is a valid char in an UTF8/iso-8859-1 encoded source file and only 
 there, while the latter is a single invalid UTF8 char part. How would you 
 interpret unicode:\xab\x65 then?

I'd want \xab and \x65 to represent two codepoints, not encoding bytes
for a single codepoint.

Pm


Re: encoding vs charset

2008-07-15 Thread Mark J. Reed
 unicode:\ab is illegal

No way.  Unicode \ab should represent U+00AB.  I don't care what
the byte-level representation is.  In UTF-8, that's 0xc2 0xab; in
UTF-16BE it's 0x00 00ab; in UTF-32LE it's 0xab 0x00 0x00 0x00.

 I think that there is still some confusion between the encoding of source code
 with the desired meaning in the charset and the internal encoding of parrot,
 which might be UCS2 or anything.

IMESHO, the encoding of the source code should have no bearing on the
interpretation of string literal escape sequences within that source
code.  \ab should mean U+00AB no matter whether the surrounding
source code is UTF-8, ISO-8859-1, Big-5, whatever; if the source
language wants to work differently, it's up to its parser to convert.

-- 
Mark J. Reed [EMAIL PROTECTED]


Re: encoding vs charset

2008-07-15 Thread Leopold Toetsch
Am Dienstag, 15. Juli 2008 23:35 schrieb Patrick R. Michaud:
 On Tue, Jul 15, 2008 at 11:17:23PM +0200, Leopold Toetsch wrote:
  21:51  pmichaud so   unicode:«   and unicode:\xab  would produce
  exactly the same result.
  21:51  pmichaud even down to being the same .pbc output.
  21:51  allison pmichaud: exactly
 
  The former is a valid char in an UTF8/iso-8859-1 encoded source file and
  only there, while the latter is a single invalid UTF8 char part. How
  would you interpret unicode:\xab\x65 then?

 I'd want \xab and \x65 to represent two codepoints, not encoding bytes
 for a single codepoint.

And that shall be the distinguished from:

U+AB65: ꭥ  

by what?

 Pm

leo


Re: encoding vs charset

2008-07-15 Thread NotFound
On Tue, Jul 15, 2008 at 11:45 PM, Mark J. Reed [EMAIL PROTECTED] wrote:

 IMESHO, the encoding of the source code should have no bearing on the
 interpretation of string literal escape sequences within that source
 code.  \ab should mean U+00AB no matter whether the surrounding
 source code is UTF-8, ISO-8859-1, Big-5, whatever; if the source
 language wants to work differently, it's up to its parser to convert.

The HLL source must not be relevant here, if we a reach a clear spec
will be plain easy for hll writers to generate the pir that gives the
result they want and to use the rules for his sources that his
languages imposes or allows.

I think that the Escaped are always codepoints is the clean and
consistent approach.

-- 
Salu2


Re: encoding vs charset

2008-07-15 Thread Mark J. Reed
Uhm, by the fact that they didn't type \ab65 ?




On 7/15/08, Leopold Toetsch [EMAIL PROTECTED] wrote:
 Am Dienstag, 15. Juli 2008 23:35 schrieb Patrick R. Michaud:
 On Tue, Jul 15, 2008 at 11:17:23PM +0200, Leopold Toetsch wrote:
  21:51  pmichaud so   unicode:«   and unicode:\xab  would produce
  exactly the same result.
  21:51  pmichaud even down to being the same .pbc output.
  21:51  allison pmichaud: exactly
 
  The former is a valid char in an UTF8/iso-8859-1 encoded source file and
  only there, while the latter is a single invalid UTF8 char part. How
  would you interpret unicode:\xab\x65 then?

 I'd want \xab and \x65 to represent two codepoints, not encoding bytes
 for a single codepoint.

 And that shall be the distinguished from:

 U+AB65: ꭥ

 by what?

 Pm

 leo


-- 
Sent from Gmail for mobile | mobile.google.com

Mark J. Reed [EMAIL PROTECTED]


Re: encoding vs charset

2008-07-15 Thread NotFound
To open another can of worms, I think that we can live without
character set specification. We can stablish that the character set is
always unicode, and to deal only with encodings. Ascii is an encoding
that maps directly to codepoints and only allows 0-127 values.
iso-8859-1 is the same with 0-255 range. Any other 8 bit encoding just
need a translation table. The only point to solve is we need some
special way to work with fixed-8 with no intended character
representation.

-- 
Salu2


Re: encoding vs charset

2008-07-15 Thread Moritz Lenz
NotFound wrote:
 To open another can of worms, I think that we can live without
 character set specification. We can stablish that the character set is
 always unicode, and to deal only with encodings. 

We had that discussion already, and the answer was no for several reasons:
* Strings might contain binary data, it doesn't make sense to view them
as Unicode
* Unicode isn't necessarily universal, or might stop to be so in future.
If a character is not representable in Unicode, and you chose to use
Unicode for everything, you're screwed
* related to the previous point, some other character encodings might
not have a lossless round-trip conversion.

 Ascii is an encoding
 that maps directly to codepoints and only allows 0-127 values.
 iso-8859-1 is the same with 0-255 range. Any other 8 bit encoding just
 need a translation table. The only point to solve is we need some
 special way to work with fixed-8 with no intended character
 representation.

Introducing the no character set character set is just a special case
of arbitrary character sets. I see no point in using the special case
over the generic one.

Here's the discussion we had on this subject:
http://irclog.perlgeek.de/parrot/2008-06-23#i_362697

Cheers,
Moritz

-- 
Moritz Lenz
http://moritz.faui2k3.org/ |  http://perl-6.de/


Re: encoding vs charset

2008-07-15 Thread NotFound
 * Unicode isn't necessarily universal, or might stop to be so in future.
 If a character is not representable in Unicode, and you chose to use
 Unicode for everything, you're screwed

There are provision for private usage codepoints.

 * related to the previous point, some other character encodings might
 not have a lossless round-trip conversion.

Did we need that? The intention is that strings are stored in the
format wanted and not recoded without a good reason.

 need a translation table. The only point to solve is we need some
 special way to work with fixed-8 with no intended character
 representation.
 Introducing the no character set character set is just a special case
 of arbitrary character sets. I see no point in using the special case
 over the generic one.

Because is special, and we need to deal with his speciality in any
case. Just concatenating it with any other is plain wrong. Just
treating it as iso-8859-1 is not taken in as plain binary at all.

But the main point is that the encoding issues is complicated enough
even inside unicode, and adding another layer of complexity will make
it worse.

-- 
Salu2


Re: encoding vs charset

2008-07-15 Thread Moritz Lenz
NotFound wrote:
 * Unicode isn't necessarily universal, or might stop to be so in future.
 If a character is not representable in Unicode, and you chose to use
 Unicode for everything, you're screwed
 
 There are provision for private usage codepoints.

If we use them in parrot, we can't use them in HLLs, right? do we really
want that?

 * related to the previous point, some other character encodings might
 not have a lossless round-trip conversion.
 
 Did we need that? The intention is that strings are stored in the
 format wanted and not recoded without a good reason.

But if you can't work with non-Unicode text strings, you have to convert
them, and in the process you possibly lose information. That's why we
want to enable text strings with non-Unicode semantics.

 need a translation table. The only point to solve is we need some
 special way to work with fixed-8 with no intended character
 representation.
 Introducing the no character set character set is just a special case
 of arbitrary character sets. I see no point in using the special case
 over the generic one.
 
 Because is special, and we need to deal with his speciality in any
 case. Just concatenating it with any other is plain wrong. Just
 treating it as iso-8859-1 is not taken in as plain binary at all.

Just as it is plain wrong to concatenate strings in an two
non-compatible character sets (unless you store the strings as trees,
and have each substring carry both its encoding and charset information.
But then you still can't compare them, for example).

 But the main point is that the encoding issues is complicated enough
 even inside unicode, and adding another layer of complexity will make
 it worse.

I think that distinguishing incompatible character sets is no harder
than distinguishing text and binary strings. It's not another layer,
it's just a layer used in a more general way.

Moritz

-- 
Moritz Lenz
http://moritz.faui2k3.org/ |  http://perl-6.de/