Re: What extended ASCII character set uses 0x9D?

2017-08-22 Thread Chris Angelico
On Tue, Aug 22, 2017 at 5:15 PM, Gregory Ewing wrote: > Chris Angelico wrote: >> >> a naive ASCII upper-casing wouldn't produce 0x81 either - if it did, it >> would also convert 0x21 ("!") into 0x01 (SOH, a control character). So >> this one's still a mystery. > > > It's unlikely that even a naive

Re: What extended ASCII character set uses 0x9D?

2017-08-22 Thread Gregory Ewing
Chris Angelico wrote: a naive ASCII upper-casing wouldn't produce 0x81 either - if it did, it would also convert 0x21 ("!") into 0x01 (SOH, a control character). So this one's still a mystery. It's unlikely that even a naive ascii upper/lower casing algorithm would be *that* naive; it would hav

Re: What extended ASCII character set uses 0x9D?

2017-08-19 Thread Gregory Ewing
Ian Kelly wrote: One possibility is that it's the same two bytes. That would make it 0xE2 0x80 0x9D which is "right double quotation mark". Since it keeps appearing after ending double quotes that seems plausible, although one has to wonder why it appears *in addition to* the ASCII double quotes.

Re: What extended ASCII character set uses 0x9D?

2017-08-18 Thread John Nagle
On 08/17/2017 05:53 PM, Chris Angelico wrote: On Fri, Aug 18, 2017 at 10:30 AM, John Nagle wrote: On 08/17/2017 05:14 PM, John Nagle wrote: I'm cleaning up some data which has text description fields from multiple sources. A few more cases: bytearray(b'\xe5\x81ukasz zmywaczyk') This

Re: What extended ASCII character set uses 0x9D?

2017-08-18 Thread Piet van Oostrum
Marko Rauhamaa writes: > Chris Angelico : > >> Ohh. We have no evidence that uppercasing is going on here, and a >> naive ASCII upper-casing wouldn't produce 0x81 either - if it did, it >> would also convert 0x21 ("!") into 0x01 (SOH, a control character). So >> this one's still a mystery. > > BT

Re: What extended ASCII character set uses 0x9D?

2017-08-18 Thread Random832
On Fri, Aug 18, 2017, at 03:39, Marko Rauhamaa wrote: > BTW, I was reading up on the history of ASCII control characters. Quite > fascinating. > > For example, have you ever wondered why DEL is the odd control character > out at the code point 127? The reason turns out to be paper punch tape. > By

Re: What extended ASCII character set uses 0x9D?

2017-08-18 Thread MRAB
On 2017-08-18 04:46, John Nagle wrote: On 08/17/2017 05:53 PM, Chris Angelico wrote:> On Fri, Aug 18, 2017 at 10:30 AM, John Nagle wrote: >> On 08/17/2017 05:14 PM, John Nagle wrote: >>> I'm cleaning up some data which has text description fields from >>> multiple sources. >> A few

Re: What extended ASCII character set uses 0x9D?

2017-08-18 Thread Chris Angelico
On Fri, Aug 18, 2017 at 5:39 PM, Marko Rauhamaa wrote: > Chris Angelico : > >> Ohh. We have no evidence that uppercasing is going on here, and a >> naive ASCII upper-casing wouldn't produce 0x81 either - if it did, it >> would also convert 0x21 ("!") into 0x01 (SOH, a control character). So >> thi

Re: What extended ASCII character set uses 0x9D?

2017-08-18 Thread Marko Rauhamaa
Chris Angelico : > Ohh. We have no evidence that uppercasing is going on here, and a > naive ASCII upper-casing wouldn't produce 0x81 either - if it did, it > would also convert 0x21 ("!") into 0x01 (SOH, a control character). So > this one's still a mystery. BTW, I was reading up on the history

Re: What extended ASCII character set uses 0x9D?

2017-08-18 Thread Chris Angelico
On Fri, Aug 18, 2017 at 5:11 PM, Marko Rauhamaa wrote: > Chris Angelico : > >> On Fri, Aug 18, 2017 at 4:57 PM, Marko Rauhamaa wrote: >>> Chris Angelico : >>> On Fri, Aug 18, 2017 at 4:38 PM, Paul Rubin wrote: > John Nagle writes: >> Since, as someone pointed out, there was U

Re: What extended ASCII character set uses 0x9D?

2017-08-18 Thread Marko Rauhamaa
Chris Angelico : > On Fri, Aug 18, 2017 at 4:57 PM, Marko Rauhamaa wrote: >> Chris Angelico : >> >>> On Fri, Aug 18, 2017 at 4:38 PM, Paul Rubin wrote: John Nagle writes: > Since, as someone pointed out, there was UTF-8 which had been > run through an ASCII-type lower casing algori

Re: What extended ASCII character set uses 0x9D?

2017-08-18 Thread Chris Angelico
On Fri, Aug 18, 2017 at 4:57 PM, Marko Rauhamaa wrote: > Chris Angelico : > >> On Fri, Aug 18, 2017 at 4:38 PM, Paul Rubin wrote: >>> John Nagle writes: Since, as someone pointed out, there was UTF-8 which had been run through an ASCII-type lower casing algorithm >>> >>> I spent a few

Re: What extended ASCII character set uses 0x9D?

2017-08-18 Thread Marko Rauhamaa
Chris Angelico : > On Fri, Aug 18, 2017 at 4:38 PM, Paul Rubin wrote: >> John Nagle writes: >>> Since, as someone pointed out, there was UTF-8 which had been >>> run through an ASCII-type lower casing algorithm >> >> I spent a few minutes figuring out if some of the mysterious 0x81's >> could be

Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread Chris Angelico
On Fri, Aug 18, 2017 at 4:38 PM, Paul Rubin wrote: > John Nagle writes: >> Since, as someone pointed out, there was UTF-8 which had been >> run through an ASCII-type lower casing algorithm > > I spent a few minutes figuring out if some of the mysterious 0x81's > could be from ASCII-lower-casing s

Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread Paul Rubin
John Nagle writes: > Since, as someone pointed out, there was UTF-8 which had been > run through an ASCII-type lower casing algorithm I spent a few minutes figuring out if some of the mysterious 0x81's could be from ASCII-lower-casing some Unicode combining characters, but the numbers didn't seem

Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread Chris Angelico
On Fri, Aug 18, 2017 at 4:24 PM, John Nagle wrote: >I'm coming around to the idea that some of these snippets > have been previously mis-converted, which is why they make no sense. > Since, as someone pointed out, there was UTF-8 which had been > run through an ASCII-type lower casing algorith

Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread John Nagle
On 08/17/2017 10:12 PM, Ian Kelly wrote: Here's some more 0x9d usage, each from a different data item: Guitar Pro, JamPlay, RedBana\\\'s Audition,\x9d Doppleganger\x99s The Lounge\x9d or Heatwave Interactive\x99s Platinum Life Country,\\" This one seems like a good hint since \x99 here looks

Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread Steve D'Aprano
On Fri, 18 Aug 2017 10:14 am, John Nagle wrote: > I'm cleaning up some data which has text description fields from > multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252. > And some are in some other character set. So I have to examine and > sanity check each field in a database

Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread Ian Kelly
On Thu, Aug 17, 2017 at 9:46 PM, John Nagle wrote: >The 0x9d thing seems unrelated to the Polish names thing. 0x9d > shows up in the middle of English text that's otherwise ASCII. > Is this something that can appear as a result of cutting and > pasting from Microsoft Word? > >I'd like to

Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread John Nagle
On 08/17/2017 05:53 PM, Chris Angelico wrote:> On Fri, Aug 18, 2017 at 10:30 AM, John Nagle wrote: >> On 08/17/2017 05:14 PM, John Nagle wrote: >>> I'm cleaning up some data which has text description fields from >>> multiple sources. >> A few more cases: >> >> bytearray(b'\xe5\x81ukasz zm

Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread Ian Kelly
On Thu, Aug 17, 2017 at 8:15 PM, MRAB wrote: > On 2017-08-18 01:53, Chris Angelico wrote: >> So here's an insane theory: something attempted to lower-case the byte >> stream as if it were ASCII. If you ignore the high bit, 0xC5 looks >> like 0x45 or "E", which lower-cases by having 32 added to it,

Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread MRAB
On 2017-08-18 01:30, John Nagle wrote: On 08/17/2017 05:14 PM, John Nagle wrote: > I'm cleaning up some data which has text description fields from > multiple sources. A few more cases: bytearray(b'miguel \xe3\x81ngel santos') bytearray(b'lidija kmeti\xe4\x8d') bytearray(b'\xe5\x81ukasz

Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread MRAB
On 2017-08-18 01:53, Chris Angelico wrote: On Fri, Aug 18, 2017 at 10:30 AM, John Nagle wrote: On 08/17/2017 05:14 PM, John Nagle wrote: I'm cleaning up some data which has text description fields from multiple sources. A few more cases: bytearray(b'\xe5\x81ukasz zmywaczyk') This one

Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread MRAB
On 2017-08-18 01:14, John Nagle wrote: I'm cleaning up some data which has text description fields from multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252. And some are in some other character set. So I have to examine and sanity check each field in a database dump, deciding

Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread Ben Bacarisse
John Nagle writes: > I'm cleaning up some data which has text description fields from > multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252. > And some are in some other character set. So I have to examine and > sanity check each field in a database dump, deciding which characte

Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread Ian Kelly
On Thu, Aug 17, 2017 at 6:53 PM, Chris Angelico wrote: > That doesn't work for everything, though. The 0x81 0x81 and 0x9d ones > are still a puzzle. I'm fairly sure that b'M\x81\x81\xfcnster' is 'Münster'. It decodes to that in Latin-1 if you remove the \x81 bytes. The question then is what those

Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread Chris Angelico
On Fri, Aug 18, 2017 at 10:54 AM, Ian Kelly wrote: > On Thu, Aug 17, 2017 at 6:52 PM, Ian Kelly wrote: >> On Thu, Aug 17, 2017 at 6:30 PM, John Nagle wrote: >>> A few more cases: >>> >>> bytearray(b'miguel \xe3\x81ngel santos') >> >> If that were b'\xc3\x81' it would be Á in UTF-8 which would fi

Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread Ian Kelly
On Thu, Aug 17, 2017 at 6:52 PM, Ian Kelly wrote: > On Thu, Aug 17, 2017 at 6:30 PM, John Nagle wrote: >> A few more cases: >> >> bytearray(b'miguel \xe3\x81ngel santos') > > If that were b'\xc3\x81' it would be Á in UTF-8 which would fit the > rest of the name. > >> bytearray(b'\xe5\x81ukasz zmy

Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread Ian Kelly
On Thu, Aug 17, 2017 at 6:30 PM, John Nagle wrote: > A few more cases: > > bytearray(b'miguel \xe3\x81ngel santos') If that were b'\xc3\x81' it would be Á in UTF-8 which would fit the rest of the name. > bytearray(b'\xe5\x81ukasz zmywaczyk') If that were b'\xc5\x81' it would be Ł in UTF-8 which

Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread Chris Angelico
On Fri, Aug 18, 2017 at 10:30 AM, John Nagle wrote: > On 08/17/2017 05:14 PM, John Nagle wrote: >> I'm cleaning up some data which has text description fields from >> multiple sources. > A few more cases: > > bytearray(b'\xe5\x81ukasz zmywaczyk') This one has to be Polish, and the first char

Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread Ian Kelly
On Thu, Aug 17, 2017 at 6:27 PM, Chris Angelico wrote: > On Fri, Aug 18, 2017 at 10:14 AM, John Nagle wrote: >> I'm cleaning up some data which has text description fields from >> multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252. >> And some are in some other character set. S

Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread John Nagle
On 08/17/2017 05:14 PM, John Nagle wrote: > I'm cleaning up some data which has text description fields from > multiple sources. A few more cases: bytearray(b'miguel \xe3\x81ngel santos') bytearray(b'lidija kmeti\xe4\x8d') bytearray(b'\xe5\x81ukasz zmywaczyk') bytearray(b'M\x81\x81\xfcnster'

Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread Chris Angelico
On Fri, Aug 18, 2017 at 10:14 AM, John Nagle wrote: > I'm cleaning up some data which has text description fields from > multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252. > And some are in some other character set. So I have to examine and > sanity check each field in a databa

What extended ASCII character set uses 0x9D?

2017-08-17 Thread John Nagle
I'm cleaning up some data which has text description fields from multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252. And some are in some other character set. So I have to examine and sanity check each field in a database dump, deciding which character set best represents what's