Re: Terminology question re ASCII

2013-10-28 Thread David Starner
On Mon, Oct 28, 2013 at 10:38 PM, Mark Davis ☕  wrote:
> Normally the term ASCII just refers to the 7-bit form. What is sometimes
> called "8-bit ASCII" is the same as ISO Latin 1. If you want to be
> completely clear, you can say "7-bit ASCII".

One of the first hits for "8-bit ASCII" on Google Books is "When the
Mac came out. it supported 8-bit ASCII.", courtesy of "Introduction to
Digital Publishing", by David Bergsland. (He also seems to be under
the delusion that MS-DOS used 7-bit ASCII.) I don't think you can
assume anything about 8-bit ASCII besides the lower bits (hopefully)
begin compatible with ASCII.

-- 
Kie ekzistas vivo, ekzistas espero.




Re: Terminology question re ASCII

2013-10-28 Thread Jukka K. Korpela

2013-10-29 6:12, d...@bisharat.net wrote:


If one refers to "plain ASCII," or "plain ASCII text" or "...
characters," should this be taken strictly as referring to the 7-bit
basic characters, or might it encompass characters that might appear
in an 8-bit character set (per the so-called "extended ASCII")?


In correct usage, “ASCII” refers to a specific standard, namely 
“American  National  Standard for  Information  Systems  -

Coded  Character  Sets  - 7-Bit  American  National  Standard  Code
for  Information  Interchange (7-Bit  ASCII)”,  ANSI X3.4-1986, except 
in historical presentations, where it might refer to predecessors of 
that standard (earlier versions of ASCII).


In common usage, “ASCII” is also used to denote a) text data in general, 
b) some 8-bit encoding that has ASCII characters as its 7-bit subset, 
and c) other things. This can be very confusing, and that’s why the 
standard has the parenthetic note “7-Bit ASCII” and why people often use 
“US-ASCII” as the name of the ASCII encoding. The clarifying prefixes 
are, however, also misleading in the sense that they suggests the 
existence of other ASCIIs.



I've always used the term "ASCII" in the 7-bit, 128 character sense,
and modifying it with "plain" seems to reinforce that sense.
(Although "plain text" in my understanding actually refers to lack of
formatting.)


The attribute “plain” probably refers to plain text in the contexts 
given. Once people make the mistake of writing “ASCII” when they mean 
“text”, further confusion will be caused by attributes like “plain”, 
which are indeed ambiguous.



Reason for asking is encountering a reference to "plain ASCII"
describing text that clearly (by presence of accented characters)
would be 8-bit.


It probably means “plain text”. But it could also mean “text in an 8-bit 
encoding”, if the author thinks of encodings like ISO 8859-1, 
windows-1252, ISO 8859-2, cp-850, Mac Roman, etc., as “extended ASCII” 
and even drops the attribute “extended”. It is conceivable that “plain 
ASCII” is even used to emphasize that the text is not in a Unicode encoding.



The context is one of many situations where in attaching a document
to an email, it is advisable to include an unformatted text version
of the document in the body of the email. Never mind that the latter
is probably in UTF-8 anyway(?) - the issue here is the terminology.


The proper term for plain text is “plain text”. The word “unformatted” 
is often used, and might be seen as intuitively descriptive 
(unformatted, as opposite to text that contains formatting like bolding, 
colors, and different fonts), but it is risky. For one thing, plain text 
is often displayed “as is” with respect to line breaks and indentation, 
i.e. as “preformatted” (as in  elements in HTML). Moreover, text 
that is not plain text need not be formatted. It could be e.g. an XML 
file where XML tags are used to mark up structural parts of the text, 
without causing or implying any specific formatting in rendering.


Yucca






Fwd: Terminology question re ASCII

2013-10-28 Thread Christopher Vance
Sorry, should have cc:d the list. Assume original mail was from a list
member.

-- Forwarded message --
From: Christopher Vance 
Date: 29 October 2013 16:58
Subject: Re: Terminology question re ASCII
To: Mark Davis ☕ 


Of course, once you have 8-bit characters in the upper range from 0x80 up,
you can only know intrinsically that it's not actually ASCII, and that
anybody who says it is, is probably lying.

You can only determine the actual character set used from extrinsic
information. Is the 8th bit just parity? Is it a Microsoft set with those
graphical things? Is it one of the Latin-N sets (which one)? EBCDIC?
Something else?


On 29 October 2013 16:38, Mark Davis ☕  wrote:

> Normally the term ASCII just refers to the 7-bit form. What is sometimes
> called "8-bit ASCII" is the same as ISO Latin 1. If you want to be
> completely clear, you can say "7-bit ASCII".
>
>
> Mark 
> *
> *
> *— Il meglio è l’inimico del bene —*
> **
>
>
> On Tue, Oct 29, 2013 at 5:12 AM,  wrote:
>
>> Quick question on terminology use concerning a legacy encoding:
>>
>> If one refers to "plain ASCII," or "plain ASCII text" or "...
>> characters," should this be taken strictly as referring to the 7-bit basic
>> characters, or might it encompass characters that might appear in an 8-bit
>> character set (per the so-called "extended ASCII")?
>>
>> I've always used the term "ASCII" in the 7-bit, 128 character sense, and
>> modifying it with "plain" seems to reinforce that sense. (Although "plain
>> text" in my understanding actually refers to lack of formatting.)
>>
>> Reason for asking is encountering a reference to "plain ASCII" describing
>> text that clearly (by presence of accented characters) would be 8-bit.
>>
>> The context is one of many situations where in attaching a document to an
>> email, it is advisable to include an unformatted text version of the
>> document in the body of the email. Never mind that the latter is probably
>> in UTF-8 anyway(?) - the issue here is the terminology.
>>
>> TIA for any feedback.
>>
>> Don Osborn
>>
>> Sent via BlackBerry by AT&T
>>
>>
>>
>


-- 
Christopher Vance



-- 
Christopher Vance


Re: Terminology question re ASCII

2013-10-28 Thread Mark Davis ☕
Normally the term ASCII just refers to the 7-bit form. What is sometimes
called "8-bit ASCII" is the same as ISO Latin 1. If you want to be
completely clear, you can say "7-bit ASCII".


Mark 
*
*
*— Il meglio è l’inimico del bene —*
**


On Tue, Oct 29, 2013 at 5:12 AM,  wrote:

> Quick question on terminology use concerning a legacy encoding:
>
> If one refers to "plain ASCII," or "plain ASCII text" or "... characters,"
> should this be taken strictly as referring to the 7-bit basic characters,
> or might it encompass characters that might appear in an 8-bit character
> set (per the so-called "extended ASCII")?
>
> I've always used the term "ASCII" in the 7-bit, 128 character sense, and
> modifying it with "plain" seems to reinforce that sense. (Although "plain
> text" in my understanding actually refers to lack of formatting.)
>
> Reason for asking is encountering a reference to "plain ASCII" describing
> text that clearly (by presence of accented characters) would be 8-bit.
>
> The context is one of many situations where in attaching a document to an
> email, it is advisable to include an unformatted text version of the
> document in the body of the email. Never mind that the latter is probably
> in UTF-8 anyway(?) - the issue here is the terminology.
>
> TIA for any feedback.
>
> Don Osborn
>
> Sent via BlackBerry by AT&T
>
>
>


Terminology question re ASCII

2013-10-28 Thread dzo
Quick question on terminology use concerning a legacy encoding:

If one refers to "plain ASCII," or "plain ASCII text" or "... characters," 
should this be taken strictly as referring to the 7-bit basic characters, or 
might it encompass characters that might appear in an 8-bit character set (per 
the so-called "extended ASCII")?

I've always used the term "ASCII" in the 7-bit, 128 character sense, and 
modifying it with "plain" seems to reinforce that sense. (Although "plain text" 
in my understanding actually refers to lack of formatting.)

Reason for asking is encountering a reference to "plain ASCII" describing text 
that clearly (by presence of accented characters) would be 8-bit. 

The context is one of many situations where in attaching a document to an 
email, it is advisable to include an unformatted text version of the document 
in the body of the email. Never mind that the latter is probably in UTF-8 
anyway(?) - the issue here is the terminology. 

TIA for any feedback. 

Don Osborn

Sent via BlackBerry by AT&T




Re: Re: Do you know a tool to decode "UTF-8 twice"

2013-10-28 Thread Buck Golemon
On Mon, Oct 28, 2013 at 9:48 AM, Buck Golemon  wrote:

>
>
>
> On Mon, Oct 28, 2013 at 6:06 AM, "Jörg Knappen"  wrote:
>
>> Hi Steffen,
>>
>> data aren't that easy. There are non-latin1-characters encoded in the
>> UTF8 part. I expect
>> among others typographic apostrophes, polish characters, some
>> mediaevalist characters like
>> ũ (u with tilde). Maybe, there is also some greek inside, but I am not
>> sure about that.
>>
>> --Jörg Knappen
>>
>> *Gesendet:* Montag, 28. Oktober 2013 um 12:34 Uhr
>> *Von:* "Steffen \"Daode\" Nurpmeso" 
>> *An:* "Jörg Knappen" 
>> *Cc:* unicode@unicode.org
>> *Betreff:* Re: Do you know a tool to decode "UTF-8 twice"
>> "Jörg Knappen"  wrote:
>> | Is there a ready made tool that decodes "UTF-8 twice" while keeping
>> | UTF-8 proper in place?
>>
>> Isn't a shell script with a truly validating iconv(1) enough?
>> This works for me if in utf8.1 there is 'ÄEIÖÜ' in UTF-8 and i run
>>
>> ?0[steffen@sherwood tmp]$ iconv -f latin1 -t utf8 < utf8.1 > utf8.2
>>
>> As in
>>
>> for i in utf8.1 utf8.2; do
>> if iconv -f utf8 -t latin1 < ${i} |
>> iconv -f utf8 -t utf8 >/dev/null 2>&1; then
>> echo ${i}: bummer, going home by one
>> iconv -f utf8 -t latin1 < ${i} > ${i}.new 2>&1
>> else
>> echo ${i}: valid UTF-8
>> fi
>> done
>>
>> i'll end up as
>>
>> ?0[steffen@sherwood tmp]$ sh utf8dec.sh
>> utf8.1: valid UTF-8
>> utf8.2: bummer, going home by one
>> ?0[steffen@sherwood tmp]$
>>
>> Ciao,
>>
>> | --Jörg Knappen
>>
>> --steffen
>>
>
> Jörg: There's no ready-made tool, but it's easy to write in python.
> I'll provide you a well-tested function in a few minutes.
>
>
>
>
Jörg:

Here is my function (also attached):
http://paste.pound-python.org/show/jAfKzb5HEyOeGvyF7O9W/
You can either make a larger python program with it, or expose it directly
to shell scripting.
These are the tests passed:

A latin1-encoded string should become utf8-encoded ... ok
An un-encoded unicode string should just become utf8-encoded ... ok
A utf8-encoded string should be unchanged ... ok
A poorly-encoded utf8+latin1 string should become utf8-encoded ... ok
A string mangled by utf8+latin1 several times should become utf8-encoded
... ok

--
Ran 5 tests in 0.001s

OK
# -*- coding: UTF-8 -*-
def recode_utf8(data):
"""
Given a string which is either:
 * unicode
 * well-encoded utf8
 * well-encoded latin1
 * poorly-encoded utf8+latin1
Return the equivalent utf8-encoded byte string.
"""
if isinstance(data, unicode):
# The input is already decoded. Just return the utf8.
return data.encode('UTF-8')

try:
decoded = data.decode('UTF-8')
except UnicodeDecodeError:
# Indicates latin1 encoded bytes.
decoded = data.decode('latin1')

while True:
# Check if the data is poorly-encoded as utf8+latin1
try:
encoded = decoded.encode('latin1')
except UnicodeEncodeError:
# Indicates non-latin1-encodable characters; it's not 
utf8+latin1.
return decoded.encode('UTF-8')

try:
decoded = encoded.decode('UTF-8')
except UnicodeDecodeError:
# Can't decode the latin1 as utf8; it's not utf8+latin1.
return decoded.encode('UTF-8')


import unittest as T
class TestRecodeUtf8(T.TestCase):
latin1 = u'München' # encodable to latin1
utf8 = u'Łódź' # not encodable to latin1

def test_unicode(self):
"An un-encoded unicode string should just become utf8-encoded"
self.assertEqual(
recode_utf8(self.utf8),
self.utf8.encode('UTF-8'),
)

def test_utf8(self):
"A utf8-encoded string should be unchanged"
utf8 = self.utf8.encode('UTF-8')
self.assertEqual(
recode_utf8(utf8),
utf8,
)

def test_latin1(self):
"A latin1-encoded string should become utf8-encoded"
self.assertEqual(
recode_utf8(self.latin1.encode('latin1')),
self.latin1.encode('UTF-8'),
)

def test_utf8_plus_latin1(self):
"A poorly-encoded utf8+latin1 string should become utf8-encoded"
utf8 = self.utf8.encode('UTF-8')
poorly_encoded = utf8.decode('latin1').encode('UTF-8')
self.assertEqual(
recode_utf8(poorly_encoded),
utf8,
)

def test_utf8_plus_latin1_several_times(self):
"A string mangle

Re: Re: Do you know a tool to decode "UTF-8 twice"

2013-10-28 Thread Rebecca Bettencourt
Jörg, by any chance would this do what you need?

http://www.kreativekorp.com/software/recode/#reinterpret

-- Rebecca Bettencourt


On Mon, Oct 28, 2013 at 9:48 AM, Buck Golemon  wrote:
>
>
>
> On Mon, Oct 28, 2013 at 6:06 AM, "Jörg Knappen"  wrote:
>>
>> Hi Steffen,
>>
>> data aren't that easy. There are non-latin1-characters encoded in the UTF8
>> part. I expect
>> among others typographic apostrophes, polish characters, some mediaevalist
>> characters like
>> ũ (u with tilde). Maybe, there is also some greek inside, but I am not
>> sure about that.
>>
>> --Jörg Knappen
>>
>> Gesendet: Montag, 28. Oktober 2013 um 12:34 Uhr
>> Von: "Steffen \"Daode\" Nurpmeso" 
>> An: "Jörg Knappen" 
>> Cc: unicode@unicode.org
>> Betreff: Re: Do you know a tool to decode "UTF-8 twice"
>> "Jörg Knappen"  wrote:
>> | Is there a ready made tool that decodes "UTF-8 twice" while keeping
>> | UTF-8 proper in place?
>>
>> Isn't a shell script with a truly validating iconv(1) enough?
>> This works for me if in utf8.1 there is 'ÄEIÖÜ' in UTF-8 and i run
>>
>> ?0[steffen@sherwood tmp]$ iconv -f latin1 -t utf8 < utf8.1 > utf8.2
>>
>> As in
>>
>> for i in utf8.1 utf8.2; do
>> if iconv -f utf8 -t latin1 < ${i} |
>> iconv -f utf8 -t utf8 >/dev/null 2>&1; then
>> echo ${i}: bummer, going home by one
>> iconv -f utf8 -t latin1 < ${i} > ${i}.new 2>&1
>> else
>> echo ${i}: valid UTF-8
>> fi
>> done
>>
>> i'll end up as
>>
>> ?0[steffen@sherwood tmp]$ sh utf8dec.sh
>> utf8.1: valid UTF-8
>> utf8.2: bummer, going home by one
>> ?0[steffen@sherwood tmp]$
>>
>> Ciao,
>>
>> | --Jörg Knappen
>>
>> --steffen
>
>
> Jörg: There's no ready-made tool, but it's easy to write in python.
> I'll provide you a well-tested function in a few minutes.
>
>
>




Re: Aw: Re: Do you know a tool to decode "UTF-8 twice"

2013-10-28 Thread Markus Scherer
Does "iconv -f utf8 -t latin1 < ${i} | iconv -f utf8 -t utf8" not work? It
decodes one layer of UTF-8 and tests if the result is still in UTF-8, that
seems right, and should work for all of Unicode.
markus


Re: Re: Do you know a tool to decode "UTF-8 twice"

2013-10-28 Thread Buck Golemon
On Mon, Oct 28, 2013 at 6:06 AM, "Jörg Knappen"  wrote:

> Hi Steffen,
>
> data aren't that easy. There are non-latin1-characters encoded in the UTF8
> part. I expect
> among others typographic apostrophes, polish characters, some mediaevalist
> characters like
> ũ (u with tilde). Maybe, there is also some greek inside, but I am not
> sure about that.
>
> --Jörg Knappen
>
> *Gesendet:* Montag, 28. Oktober 2013 um 12:34 Uhr
> *Von:* "Steffen \"Daode\" Nurpmeso" 
> *An:* "Jörg Knappen" 
> *Cc:* unicode@unicode.org
> *Betreff:* Re: Do you know a tool to decode "UTF-8 twice"
> "Jörg Knappen"  wrote:
> | Is there a ready made tool that decodes "UTF-8 twice" while keeping
> | UTF-8 proper in place?
>
> Isn't a shell script with a truly validating iconv(1) enough?
> This works for me if in utf8.1 there is 'ÄEIÖÜ' in UTF-8 and i run
>
> ?0[steffen@sherwood tmp]$ iconv -f latin1 -t utf8 < utf8.1 > utf8.2
>
> As in
>
> for i in utf8.1 utf8.2; do
> if iconv -f utf8 -t latin1 < ${i} |
> iconv -f utf8 -t utf8 >/dev/null 2>&1; then
> echo ${i}: bummer, going home by one
> iconv -f utf8 -t latin1 < ${i} > ${i}.new 2>&1
> else
> echo ${i}: valid UTF-8
> fi
> done
>
> i'll end up as
>
> ?0[steffen@sherwood tmp]$ sh utf8dec.sh
> utf8.1: valid UTF-8
> utf8.2: bummer, going home by one
> ?0[steffen@sherwood tmp]$
>
> Ciao,
>
> | --Jörg Knappen
>
> --steffen
>

Jörg: There's no ready-made tool, but it's easy to write in python.
I'll provide you a well-tested function in a few minutes.


Re: Aw: Re: Do you know a tool to decode "UTF-8 twice"

2013-10-28 Thread Daode
"Jörg Knappen"  wrote:
 |   Hi Steffen,
 |
 |   data aren't that easy. There are non-latin1-characters encoded in the
 |   UTF8 part. I expect

I see..  Fantastic, now i feel responsible to hack something
unless noone relieves me until tomorrow afternoon.
Sigh.

--steffen
--- Begin Message ---

Hi Steffen,

 

data aren't that easy. There are non-latin1-characters encoded in the UTF8 part. I expect

among others typographic apostrophes, polish characters, some mediaevalist characters like

ũ (u with tilde). Maybe, there is also some greek inside, but I am not sure about that.

 

--Jörg Knappen

 

Gesendet: Montag, 28. Oktober 2013 um 12:34 Uhr
Von: "Steffen \"Daode\" Nurpmeso" 
An: "Jörg Knappen" 
Cc: unicode@unicode.org
Betreff: Re: Do you know a tool to decode "UTF-8 twice"

"Jörg Knappen"  wrote:
| Is there a ready made tool that decodes "UTF-8 twice" while keeping
| UTF-8 proper in place?

Isn't a shell script with a truly validating iconv(1) enough?
This works for me if in utf8.1 there is 'ÄEIÖÜ' in UTF-8 and i run

?0[steffen@sherwood tmp]$ iconv -f latin1 -t utf8 < utf8.1 > utf8.2

As in

for i in utf8.1 utf8.2; do
if iconv -f utf8 -t latin1 < ${i} |
iconv -f utf8 -t utf8 >/dev/null 2>&1; then
echo ${i}: bummer, going home by one
iconv -f utf8 -t latin1 < ${i} > ${i}.new 2>&1
else
echo ${i}: valid UTF-8
fi
done

i'll end up as

?0[steffen@sherwood tmp]$ sh utf8dec.sh
utf8.1: valid UTF-8
utf8.2: bummer, going home by one
?0[steffen@sherwood tmp]$

Ciao,

| --Jörg Knappen

--steffen




--- End Message ---


Aw: Re: Do you know a tool to decode "UTF-8 twice"

2013-10-28 Thread Jörg Knappen

Hi Steffen,

 

data aren't that easy. There are non-latin1-characters encoded in the UTF8 part. I expect

among others typographic apostrophes, polish characters, some mediaevalist characters like

ũ (u with tilde). Maybe, there is also some greek inside, but I am not sure about that.

 

--Jörg Knappen

 

Gesendet: Montag, 28. Oktober 2013 um 12:34 Uhr
Von: "Steffen \"Daode\" Nurpmeso" 
An: "Jörg Knappen" 
Cc: unicode@unicode.org
Betreff: Re: Do you know a tool to decode "UTF-8 twice"

"Jörg Knappen"  wrote:
| Is there a ready made tool that decodes "UTF-8 twice" while keeping
| UTF-8 proper in place?

Isn't a shell script with a truly validating iconv(1) enough?
This works for me if in utf8.1 there is 'ÄEIÖÜ' in UTF-8 and i run

?0[steffen@sherwood tmp]$ iconv -f latin1 -t utf8 < utf8.1 > utf8.2

As in

for i in utf8.1 utf8.2; do
if iconv -f utf8 -t latin1 < ${i} |
iconv -f utf8 -t utf8 >/dev/null 2>&1; then
echo ${i}: bummer, going home by one
iconv -f utf8 -t latin1 < ${i} > ${i}.new 2>&1
else
echo ${i}: valid UTF-8
fi
done

i'll end up as

?0[steffen@sherwood tmp]$ sh utf8dec.sh
utf8.1: valid UTF-8
utf8.2: bummer, going home by one
?0[steffen@sherwood tmp]$

Ciao,

| --Jörg Knappen

--steffen






AW: ¥ instead of \

2013-10-28 Thread Dreiheller, Albrecht
On 2013/10/27 4:48, Martin J. Dürst  wrote:
> One thing that I have never checked personally, but which I heard from a 
> former colleague who knew a lot of character encoding trivia and 
> oddities, is that (at least at some point a few years ago) Japanese MS 
> Word would change U+00A6 to U+005D without asking the user. Possibly the 
> idea was that this way, the data could be more easily converted back 
> from Unicode to Shift_JIS. But in terms of moving away from using U+005D 
> with a Yen glyyh, it was definitely counterproductive.

> Regards,   Martin.

As far as I can see,
- the standard Japanese keyboard layout creates  \ U+005C if in Latin mode,
  even if the keycap might have a ¥ Yen symbol.
- the IME, however, will offer to choose either \ or ¥ interactively, if in 
appropriate mode.
- the font MS Mincho has a ¥ Yen glyph at the U+005C position,
  even in font version 5.01 shipped with Windows 7.

The feeling of an automatic substitution done by Word might come from
the default MS Word settings
-  "detect language automatically" and
- "use MS Mincho as default font" for Japanese text.
So it's not a substitution of the codepoints but of the glyphs only (which is 
of course bad enough).
If the character is formatted with "Arial", a backslash should appear again.

Best Regards, Albrecht.









Re: Do you know a tool to decode "UTF-8 twice"

2013-10-28 Thread Daode
"Jörg Knappen"  wrote:
 |   Is there a ready made tool that decodes "UTF-8 twice" while keeping
 |   UTF-8 proper in place?

Isn't a shell script with a truly validating iconv(1) enough?
This works for me if in utf8.1 there is 'ÄEIÖÜ' in UTF-8 and i run

  ?0[steffen@sherwood tmp]$ iconv -f latin1 -t utf8 < utf8.1 > utf8.2

As in

  for i in utf8.1 utf8.2; do
if iconv -f utf8 -t latin1 < ${i} |
iconv -f utf8 -t utf8 >/dev/null 2>&1; then
  echo ${i}: bummer, going home by one
  iconv -f utf8 -t latin1 < ${i} > ${i}.new 2>&1
else
  echo ${i}: valid UTF-8
fi
  done

i'll end up as

  ?0[steffen@sherwood tmp]$ sh utf8dec.sh   
   
  utf8.1: valid UTF-8
  utf8.2: bummer, going home by one
  ?0[steffen@sherwood tmp]$

Ciao,

 |   --Jörg Knappen

--steffen
--- Begin Message ---
I have a database with broken encoding, containing a lot of "UTF-8 twice"

(that infamous encoding that arises when UTF-8 is interpreted as latin-1 and

converted to UTF-8 again) encoding besides ASCII and UTF-8 proper.

 

Is there a ready made tool that decodes "UTF-8 twice" while keeping UTF-8 proper in place?

 

--Jörg Knappen

--- End Message ---


Do you know a tool to decode "UTF-8 twice"

2013-10-28 Thread Jörg Knappen
I have a database with broken encoding, containing a lot of "UTF-8 twice"

(that infamous encoding that arises when UTF-8 is interpreted as latin-1 and

converted to UTF-8 again) encoding besides ASCII and UTF-8 proper.

 

Is there a ready made tool that decodes "UTF-8 twice" while keeping UTF-8 proper in place?

 

--Jörg Knappen