Re: New Charakter Proposal

2002-11-01 Thread Markus Scherer
David Starner wrote: Chances are nearly 100% that overlong UTF-8 was a spoofing attempt, or the result of something other than a UTF-8 encoder. With the exception of overlong sequences for null (C0 80?), which Java generates in an attempt to avoid true nulls. I am aware of this one. This encod

Re: New Charakter Proposal

2002-10-31 Thread Tex Texin
William, Note the smiley. Ken's suggestion was a tongue in the hollow-skulls cheek. Yes, a 2 character sequence is less likely to occur, but is still a possibility, so your proposal doesn't actually fix the problem. The usual workaround is for a convention that uses characters with special semant

Re: New Charakter Proposal

2002-10-31 Thread William Overington
Kenneth Whistler wrote the following. >I think Marku's suggestion is correct. If you want to do >something like this internally to a process, use a noncharacter >code point for it. If you want to have visible display of this >kind of error handling for conversion, then simply declare a >convention

RE: New Charakter Proposal

2002-10-31 Thread Dominikus Scherkl
Hello. Markus Scherer wrote: > Chances are nearly 100% that overlong UTF-8 was a > spoofing attempt, or the result of something other than a > UTF-8 encoder. Correct. This is exactly my topic. Wouldn't it be nice to have a standardized way to indicate that an attack to the message has occured wi

RE: New Charakter Proposal

2002-10-30 Thread Kenneth Whistler
Dominikus Scherkl replied to Markus: > > > My other suggestion (and the main reason to call the proposed > > > charakter "source failure indicator symbol" (SFIS)) was intended > > > especaly for mall-formed utf-8 input that has overlong encodings. > > This is a special, custom form of error handli

Re: New Charakter Proposal

2002-10-30 Thread David Starner
On Wed, Oct 30, 2002 at 03:13:53PM -0800, Markus Scherer wrote: > Chances are nearly 100% that overlong UTF-8 was a spoofing attempt, or the > result of something other than a UTF-8 encoder. With the exception of overlong sequences for null (C0 80?), which Java generates in an attempt to avoid tr

Re: New Charakter Proposal

2002-10-30 Thread Markus Scherer
Dominikus Scherkl wrote: Converting from and to utf-8 is an all-day topic, very important for all applications handling with unicode. So it is a special Converting text to/from UTF-8 is indeed common and important. Converting text that claims to be UTF-8 - but isn't - is different: It may be a

RE: New Charakter Proposal

2002-10-30 Thread Dominikus Scherkl
Markus Scherer wrote: > Dominikus Scherkl wrote: > > My other suggestion (and the main reason to call the proposed > > charakter "source failure indicator symbol" (SFIS)) was intended > > especaly for mall-formed utf-8 input that has overlong encodings. > This is a special, custom form of error han

Re: New Charakter Proposal

2002-10-30 Thread Markus Scherer
Dominikus Scherkl wrote: My other suggestion (and the main reason to call the proposed charakter "source failure indicator symbol" (SFIS)) was intended especaly for mall-formed utf-8 input that has overlong encodings. In this special case a converter exactly knows which char is intended, but need

RE: New Charakter Proposal

2002-10-30 Thread Dominikus Scherkl
John Cowan wrote: > This sounds basically like an extension of U+303E IDEOGRAPHIC > VARIATION INDICATOR (whose semantic is: "The following character > is not what I want, but it's the best approximation I can get") > to non-ideographs. > > I have no problem with this idea. So you mean: use U+303

RE: New Charakter Proposal

2002-10-30 Thread Marco Cimarosti
Dominikus Scherkl wrote: > I would like to have a "source failure indicator symbol" (SFIS) > charakter in the unicode, which a charset-convertion unit may > insert into a text (Suggeested position: U+FFF8). > > [...] > > Of course a converter can still use U+FFFD if it has no > idea which charact

Re: New Charakter Proposal

2002-10-30 Thread Mark Davis
We had thought of something similar, but which would provide more information in interfaces. Reserve a space of 256 code points, with names: UNCONVERTIBLE BYTE-00 UNCONVERTIBLE BYTE-01 ... UNCONVERTIBLE BYTE-FF During a conversion process, if some bytes (say from corrupt UTF-8) cannot be correct