In message <[EMAIL PROTECTED]> [email protected] writes: > On Thu, Mar 27, 2008 at 04:39:17PM +0000, Duncan Coutts wrote: > > Can't we just reject them with the error message and ask people to fix the > > latin-1 sequences and re-upload using proper UTF-8? > > The problem is that there are packages there now with .cabal files > assuming Latin-1. Stopping more of them from getting in is fine, but > we need to display the ones that are there correctly.
Parsing them is essential, displaying them correctly is a bonus. > Hmm, after considering a few schemes it's probably simplest to introduce > strict enforcement on upload and retroactively patch the existing Latin-1 > packages to UTF. Naughty, but a one-off. I'm quite happy for those to be fixed. The main point is that parsing the files does not fail, though the content for those fields would (or at least should) contain a Unicode replacement char. > > You suggested previously that we should add a warning for the cases where an > > isolated latin-1 char in someone's name turns out to be valid UTF-8 (but > > encoding for an unexpected char). I think that's a good idea. Obviously > > that'd > > want to be a non-fatal warning. Hmm, I now can't find the note where you > > made > > that suggestion. Can you give more details on how that check would work > > exactly? > > The common case is ASCII char, non-ASCII char, ASCII char. That's not a > valid UTF-8 sequence, but fromUTF is erroneously accepting it. It needs > to tighten up to keep these errors out. Hmm. I'll replace the UTF decoder with the one from the utf8-string package (which is also BSD licensed). > Incidentally, a UTF decoder is also supposed to reject non-minimal > encodings, e.g. a 3-byte encoding for a Char that can be encoded in > 2 bytes. That's to force canonical encodings for security. I believe the utf8-string version does that correctly. It detects over-long encodings specifically and makes them an invalid char. As I understand it that's so that it generates a single replacement char for non-minimal encodings rather than several. Duncan _______________________________________________ cabal-devel mailing list [email protected] http://www.haskell.org/mailman/listinfo/cabal-devel
