On Tue, Aug 22, 2017 at 5:15 PM, Gregory Ewing
wrote:
> Chris Angelico wrote:
>>
>> a naive ASCII upper-casing wouldn't produce 0x81 either - if it did, it
>> would also convert 0x21 ("!") into 0x01 (SOH, a control character). So
>> this one's still a mystery.
>
>
> It's unlikely that even a naive
Chris Angelico wrote:
a naive ASCII upper-casing wouldn't produce 0x81 either - if it did, it
would also convert 0x21 ("!") into 0x01 (SOH, a control character). So
this one's still a mystery.
It's unlikely that even a naive ascii upper/lower casing algorithm
would be *that* naive; it would hav
Ian Kelly wrote:
One possibility is that it's the same two bytes. That would make it
0xE2 0x80 0x9D which is "right double quotation mark". Since it keeps
appearing after ending double quotes that seems plausible, although
one has to wonder why it appears *in addition to* the ASCII double
quotes.
On 08/17/2017 05:53 PM, Chris Angelico wrote:
On Fri, Aug 18, 2017 at 10:30 AM, John Nagle wrote:
On 08/17/2017 05:14 PM, John Nagle wrote:
I'm cleaning up some data which has text description fields from
multiple sources.
A few more cases:
bytearray(b'\xe5\x81ukasz zmywaczyk')
This
Marko Rauhamaa writes:
> Chris Angelico :
>
>> Ohh. We have no evidence that uppercasing is going on here, and a
>> naive ASCII upper-casing wouldn't produce 0x81 either - if it did, it
>> would also convert 0x21 ("!") into 0x01 (SOH, a control character). So
>> this one's still a mystery.
>
> BT
On Fri, Aug 18, 2017, at 03:39, Marko Rauhamaa wrote:
> BTW, I was reading up on the history of ASCII control characters. Quite
> fascinating.
>
> For example, have you ever wondered why DEL is the odd control character
> out at the code point 127? The reason turns out to be paper punch tape.
> By
On 2017-08-18 04:46, John Nagle wrote:
On 08/17/2017 05:53 PM, Chris Angelico wrote:> On Fri, Aug 18, 2017 at
10:30 AM, John Nagle wrote:
>> On 08/17/2017 05:14 PM, John Nagle wrote:
>>> I'm cleaning up some data which has text description fields from
>>> multiple sources.
>> A few
On Fri, Aug 18, 2017 at 5:39 PM, Marko Rauhamaa wrote:
> Chris Angelico :
>
>> Ohh. We have no evidence that uppercasing is going on here, and a
>> naive ASCII upper-casing wouldn't produce 0x81 either - if it did, it
>> would also convert 0x21 ("!") into 0x01 (SOH, a control character). So
>> thi
Chris Angelico :
> Ohh. We have no evidence that uppercasing is going on here, and a
> naive ASCII upper-casing wouldn't produce 0x81 either - if it did, it
> would also convert 0x21 ("!") into 0x01 (SOH, a control character). So
> this one's still a mystery.
BTW, I was reading up on the history
On Fri, Aug 18, 2017 at 5:11 PM, Marko Rauhamaa wrote:
> Chris Angelico :
>
>> On Fri, Aug 18, 2017 at 4:57 PM, Marko Rauhamaa wrote:
>>> Chris Angelico :
>>>
On Fri, Aug 18, 2017 at 4:38 PM, Paul Rubin
wrote:
> John Nagle writes:
>> Since, as someone pointed out, there was U
Chris Angelico :
> On Fri, Aug 18, 2017 at 4:57 PM, Marko Rauhamaa wrote:
>> Chris Angelico :
>>
>>> On Fri, Aug 18, 2017 at 4:38 PM, Paul Rubin wrote:
John Nagle writes:
> Since, as someone pointed out, there was UTF-8 which had been
> run through an ASCII-type lower casing algori
On Fri, Aug 18, 2017 at 4:57 PM, Marko Rauhamaa wrote:
> Chris Angelico :
>
>> On Fri, Aug 18, 2017 at 4:38 PM, Paul Rubin wrote:
>>> John Nagle writes:
Since, as someone pointed out, there was UTF-8 which had been
run through an ASCII-type lower casing algorithm
>>>
>>> I spent a few
Chris Angelico :
> On Fri, Aug 18, 2017 at 4:38 PM, Paul Rubin wrote:
>> John Nagle writes:
>>> Since, as someone pointed out, there was UTF-8 which had been
>>> run through an ASCII-type lower casing algorithm
>>
>> I spent a few minutes figuring out if some of the mysterious 0x81's
>> could be
On Fri, Aug 18, 2017 at 4:38 PM, Paul Rubin wrote:
> John Nagle writes:
>> Since, as someone pointed out, there was UTF-8 which had been
>> run through an ASCII-type lower casing algorithm
>
> I spent a few minutes figuring out if some of the mysterious 0x81's
> could be from ASCII-lower-casing s
John Nagle writes:
> Since, as someone pointed out, there was UTF-8 which had been
> run through an ASCII-type lower casing algorithm
I spent a few minutes figuring out if some of the mysterious 0x81's
could be from ASCII-lower-casing some Unicode combining characters, but
the numbers didn't seem
On Fri, Aug 18, 2017 at 4:24 PM, John Nagle wrote:
>I'm coming around to the idea that some of these snippets
> have been previously mis-converted, which is why they make no sense.
> Since, as someone pointed out, there was UTF-8 which had been
> run through an ASCII-type lower casing algorith
On 08/17/2017 10:12 PM, Ian Kelly wrote:
Here's some more 0x9d usage, each from a different data item:
Guitar Pro, JamPlay, RedBana\\\'s Audition,\x9d Doppleganger\x99s The
Lounge\x9d or Heatwave Interactive\x99s Platinum Life Country,\\"
This one seems like a good hint since \x99 here looks
On Fri, 18 Aug 2017 10:14 am, John Nagle wrote:
> I'm cleaning up some data which has text description fields from
> multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252.
> And some are in some other character set. So I have to examine and
> sanity check each field in a database
On Thu, Aug 17, 2017 at 9:46 PM, John Nagle wrote:
>The 0x9d thing seems unrelated to the Polish names thing. 0x9d
> shows up in the middle of English text that's otherwise ASCII.
> Is this something that can appear as a result of cutting and
> pasting from Microsoft Word?
>
>I'd like to
On 08/17/2017 05:53 PM, Chris Angelico wrote:> On Fri, Aug 18, 2017 at
10:30 AM, John Nagle wrote:
>> On 08/17/2017 05:14 PM, John Nagle wrote:
>>> I'm cleaning up some data which has text description fields from
>>> multiple sources.
>> A few more cases:
>>
>> bytearray(b'\xe5\x81ukasz zm
On Thu, Aug 17, 2017 at 8:15 PM, MRAB wrote:
> On 2017-08-18 01:53, Chris Angelico wrote:
>> So here's an insane theory: something attempted to lower-case the byte
>> stream as if it were ASCII. If you ignore the high bit, 0xC5 looks
>> like 0x45 or "E", which lower-cases by having 32 added to it,
On 2017-08-18 01:30, John Nagle wrote:
On 08/17/2017 05:14 PM, John Nagle wrote:
> I'm cleaning up some data which has text description fields from
> multiple sources.
A few more cases:
bytearray(b'miguel \xe3\x81ngel santos')
bytearray(b'lidija kmeti\xe4\x8d')
bytearray(b'\xe5\x81ukasz
On 2017-08-18 01:53, Chris Angelico wrote:
On Fri, Aug 18, 2017 at 10:30 AM, John Nagle wrote:
On 08/17/2017 05:14 PM, John Nagle wrote:
I'm cleaning up some data which has text description fields from
multiple sources.
A few more cases:
bytearray(b'\xe5\x81ukasz zmywaczyk')
This one
On 2017-08-18 01:14, John Nagle wrote:
I'm cleaning up some data which has text description fields from
multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252.
And some are in some other character set. So I have to examine and
sanity check each field in a database dump, deciding
John Nagle writes:
> I'm cleaning up some data which has text description fields from
> multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252.
> And some are in some other character set. So I have to examine and
> sanity check each field in a database dump, deciding which characte
On Thu, Aug 17, 2017 at 6:53 PM, Chris Angelico wrote:
> That doesn't work for everything, though. The 0x81 0x81 and 0x9d ones
> are still a puzzle.
I'm fairly sure that b'M\x81\x81\xfcnster' is 'Münster'. It decodes to
that in Latin-1 if you remove the \x81 bytes. The question then is
what those
On Fri, Aug 18, 2017 at 10:54 AM, Ian Kelly wrote:
> On Thu, Aug 17, 2017 at 6:52 PM, Ian Kelly wrote:
>> On Thu, Aug 17, 2017 at 6:30 PM, John Nagle wrote:
>>> A few more cases:
>>>
>>> bytearray(b'miguel \xe3\x81ngel santos')
>>
>> If that were b'\xc3\x81' it would be Á in UTF-8 which would fi
On Thu, Aug 17, 2017 at 6:52 PM, Ian Kelly wrote:
> On Thu, Aug 17, 2017 at 6:30 PM, John Nagle wrote:
>> A few more cases:
>>
>> bytearray(b'miguel \xe3\x81ngel santos')
>
> If that were b'\xc3\x81' it would be Á in UTF-8 which would fit the
> rest of the name.
>
>> bytearray(b'\xe5\x81ukasz zmy
On Thu, Aug 17, 2017 at 6:30 PM, John Nagle wrote:
> A few more cases:
>
> bytearray(b'miguel \xe3\x81ngel santos')
If that were b'\xc3\x81' it would be Á in UTF-8 which would fit the
rest of the name.
> bytearray(b'\xe5\x81ukasz zmywaczyk')
If that were b'\xc5\x81' it would be Ł in UTF-8 which
On Fri, Aug 18, 2017 at 10:30 AM, John Nagle wrote:
> On 08/17/2017 05:14 PM, John Nagle wrote:
>> I'm cleaning up some data which has text description fields from
>> multiple sources.
> A few more cases:
>
> bytearray(b'\xe5\x81ukasz zmywaczyk')
This one has to be Polish, and the first char
On Thu, Aug 17, 2017 at 6:27 PM, Chris Angelico wrote:
> On Fri, Aug 18, 2017 at 10:14 AM, John Nagle wrote:
>> I'm cleaning up some data which has text description fields from
>> multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252.
>> And some are in some other character set. S
On 08/17/2017 05:14 PM, John Nagle wrote:
> I'm cleaning up some data which has text description fields from
> multiple sources.
A few more cases:
bytearray(b'miguel \xe3\x81ngel santos')
bytearray(b'lidija kmeti\xe4\x8d')
bytearray(b'\xe5\x81ukasz zmywaczyk')
bytearray(b'M\x81\x81\xfcnster'
On Fri, Aug 18, 2017 at 10:14 AM, John Nagle wrote:
> I'm cleaning up some data which has text description fields from
> multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252.
> And some are in some other character set. So I have to examine and
> sanity check each field in a databa
I'm cleaning up some data which has text description fields from
multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252.
And some are in some other character set. So I have to examine and
sanity check each field in a database dump, deciding which character
set best represents what's
34 matches
Mail list logo