RE: detecting encoding in plain text (related to utf8)

2004-01-14 Thread Deepak Chand Rathore
Hi all, Great to hear so many views on detecting encoding I would also like to share something related to detecting UTF8 encoding As most of u would be knowing, we can check any stream of bytes for utf8 encoding, if any of the following sequence of bytes appears. If not , we simply

Re: Detecting encoding in Plain text

2004-01-14 Thread D. Starner
Peter Kirk writes: I agree that heuristics should be adjusted for Thai. But problems may arise if they have to be adjusted individually, and without regression errors, for all 6000+ world languages. Thai is hard because of the writing system. But most writing systems weren't encoded pre-Unicode,

Re: Detecting encoding in Plain text

2004-01-14 Thread D. Starner
- Original Message - From: Peter Kirk [EMAIL PROTECTED] Date: Tue, 13 Jan 2004 09:03:48 -0800 To: Doug Ewell [EMAIL PROTECTED] Subject: Re: Detecting encoding in Plain text On 13/01/2004 08:34, Doug Ewell wrote: Peter Kirk peterkirk at qaya dot org wrote: If a certain Unicode plain

Re: detecting encoding in plain text (related to utf8)

2004-01-14 Thread Doug Ewell
Deepak Chand Rathore deepakr at aztec dot soft dot net wrote: But, there is one concern. In some cases the utf8 byte stream starts with a BOM,( for eg. when we try reading bytes from a text file that is saved using notepad (using utf8 option )in WIN2k, after first few bytes( i suppose first 3

Re: Detecting encoding in Plain text

2004-01-14 Thread Peter Kirk
On 13/01/2004 18:05, D. Starner wrote: Peter Kirk writes: I agree that heuristics should be adjusted for Thai. But problems may arise if they have to be adjusted individually, and without regression errors, for all 6000+ world languages. Thai is hard because of the writing system. But

Re: Detecting encoding in Plain text

2004-01-14 Thread John Burger
Mark E. Shoulson wrote: If it's a heuristic we're after, then why split hairs and try to make all the rules ourselves? Get a big ol' mess of training data in as many languages as you can and hand it over to a class full of CS graduate students studying Machine Learning. Absolutely my

Re: Detecting encoding in Plain text

2004-01-14 Thread Peter Kirk
On 14/01/2004 07:16, John Burger wrote: ... By the way, I still don't quite understand what's special about Thai. Could someone elaborate? I mentioned Thai because it is the only language I know of which does not used SPACE, U+0020. It also has at least some of its own punctuation. So a Thai

Re: Detecting encoding in Plain text

2004-01-14 Thread Doug Ewell
John Burger john at mitre dot org wrote: By the way, I still don't quite understand what's special about Thai. Could someone elaborate? It was just a hypothetical example: Suppose there's some relatively obscure script, oh, I don't know, say Thai, that breaks these assumptions... There isn't

Re: detecting encoding in plain text (related to utf8)

2004-01-14 Thread Markus Scherer
Deepak Chand Rathore wrote: unicode range utf 8 encoded bytes U- - U-007F:0xxx U-0080 - U-07FF:110x 10xx U-0800 - U-:1110 10xx 10xx ... This table is not

Re: Detecting encoding in Plain text

2004-01-14 Thread Mark Davis
: Detecting encoding in Plain text On 14/01/2004 07:16, John Burger wrote: ... By the way, I still don't quite understand what's special about Thai. Could someone elaborate? I mentioned Thai because it is the only language I know of which does not used SPACE, U+0020. It also has at least

Re: Detecting encoding in Plain text

2004-01-14 Thread Peter Kirk
On 14/01/2004 09:25, Mark Davis wrote: I'm not sure which one suggested heuristic method you are referring to, ... Basically the one that in UTF-16 there are likely to be many zero bytes in either odd or even positions. ... but you are bounding to conclusions. For example, one of the

Re: Detecting encoding in Plain text

2004-01-14 Thread Frank Yung-Fong Tang
] To: John Burger [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Wed, 2004 Jan 14 08:12 Subject: Re: Detecting encoding in Plain text On 14/01/2004 07:16, John Burger wrote: ... By the way, I still don't quite understand what's special about Thai. Could someone elaborate

Re: Detecting encoding in Plain text

2004-01-14 Thread Frank Yung-Fong Tang
Does Thai use CR and LF? Peter Kirk wrote on 1/14/2004, 8:12 AM: On 14/01/2004 07:16, John Burger wrote: ... By the way, I still don't quite understand what's special about Thai. Could someone elaborate? I mentioned Thai because it is the only language I know of which does

Re: Detecting encoding in Plain text

2004-01-14 Thread Frank Yung-Fong Tang
John Burger wrote on 1/14/2004, 7:16 AM: Mark E. Shoulson wrote: If it's a heuristic we're after, then why split hairs and try to make all the rules ourselves? Get a big ol' mess of training data in as many languages as you can and hand it over to a class full of CS graduate

Re: Detecting encoding in Plain text

2004-01-14 Thread Peter Kirk
On 14/01/2004 15:35, Frank Yung-Fong Tang wrote: Does Thai use CR and LF? I hadn't forgotten this, as you will find if you look back over the whole thread. I would assume that some plain text might actually use the Unicode recommended line and paragraph separator characters, rather than CR

RE: Detecting encoding in Plain text

2004-01-14 Thread Mike Ayers
Title: RE: Detecting encoding in Plain text From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Frank Yung-Fong Tang Does Thai use CR and LF? If it's in HTML, then, like every other language, it need not. /|/|ike

RE: Detecting encoding in Plain text

2004-01-13 Thread Marco Cimarosti
Peter Kirk wrote: This one also looks dangerous. What do you mean by dangerous? This is an heuristic algorithm, so it is only supposed to work always but only in some lucky cases. If lucky cases average to, say, 20% or less then it is a bad and useless algorithm; if they average to, say, 80% or

RE: Detecting encoding in Plain text

2004-01-13 Thread Marco Cimarosti
Jon Hanna wrote: False positives can be caused by the use of U+ (which is most often encoded as 0x00) which some applications do use in text files. I have never seen such a thing, can you make an example? I can't imagine any use for a NULL in a file apart terminating records or strings

Re: Detecting encoding in Plain text

2004-01-13 Thread Peter Kirk
On 13/01/2004 02:40, Marco Cimarosti wrote: Peter Kirk wrote: This one also looks dangerous. What do you mean by dangerous? This is an heuristic algorithm, so it is only supposed to work always but only in some lucky cases. If lucky cases average to, say, 20% or less then it is a bad and

RE: Detecting encoding in Plain text

2004-01-13 Thread Marco Cimarosti
Peter Kirk wrote: What do you mean by dangerous? This is an heuristic algorithm, so it is only supposed to work always [...] (I meant: it is not supposed to work always) I would not consider an 80% algorithm to be very good - depending on the circumstances etc. But if for example 20% of my

Re: Detecting encoding in Plain text

2004-01-13 Thread Peter Kirk
On 13/01/2004 04:10, Marco Cimarosti wrote: ... In this case (as in most other similar cases), you should rather blame the people who send you e-mail without encoding declaration. I get plenty of them. But then I assume that they default to ASCII or Windows-1252. Is there in fact a formal

Re: Detecting encoding in Plain text

2004-01-13 Thread Doug Ewell
Peter Kirk peterkirk at qaya dot org wrote: If a certain Unicode plain text file uses ASCII punctuation OR spaces OR end-of-line characters, AND the file is not too short or has a very odd formatting, then the algorithm should work. True. But there may be certain languages (perhaps Thai?)

Re: Detecting encoding in Plain text

2004-01-13 Thread Peter Kirk
On 13/01/2004 08:34, Doug Ewell wrote: Peter Kirk peterkirk at qaya dot org wrote: If a certain Unicode plain text file uses ASCII punctuation OR spaces OR end-of-line characters, AND the file is not too short or has a very odd formatting, then the algorithm should work. True. But

Re: Detecting encoding in Plain text

2004-01-13 Thread Mark E. Shoulson
On 01/13/04 05:40, Marco Cimarosti wrote: Peter Kirk wrote: This one also looks dangerous. What do you mean by dangerous? This is an heuristic algorithm, so it is only supposed to work always but only in some lucky cases. If lucky cases average to, say, 20% or less then it is a bad and

Re: Detecting encoding in Plain text

2004-01-12 Thread Philippe Verdy
From: Doug Ewell [EMAIL PROTECTED] In UTF-16 practically any sequence of bytes is valid, and since you can't assume you know the language, you can't employ distribution statistics. Twelve years ago, when most text was not Unicode and all Unicode text was UTF-16, Microsoft documentation

RE: Detecting encoding in Plain text

2004-01-12 Thread Marco Cimarosti
Doug Ewell wrote: In UTF-16 practically any sequence of bytes is valid, and since you can't assume you know the language, you can't employ distribution statistics. Twelve years ago, when most text was not Unicode and all Unicode text was UTF-16, Microsoft documentation suggested a heuristic

RE: Detecting encoding in Plain text

2004-01-12 Thread jon
Quoting Marco Cimarosti [EMAIL PROTECTED]: Doug Ewell wrote: In UTF-16 practically any sequence of bytes is valid, and since you can't assume you know the language, you can't employ distribution statistics. Twelve years ago, when most text was not Unicode and all Unicode text was

Re: Detecting encoding in Plain text

2004-01-12 Thread Peter Kirk
On 12/01/2004 03:09, Marco Cimarosti wrote: ... It is extremely unlikely that a text file encoded in any single- or multi-byte encoding (including UTF-8) would contain a zero byte, so the presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or UTF-32. Is it not dangerous to

Re: Detecting encoding in Plain text

2004-01-12 Thread Philippe Verdy
From: Peter Kirk [EMAIL PROTECTED] On 12/01/2004 03:09, Marco Cimarosti wrote: ... It is extremely unlikely that a text file encoded in any single- or multi-byte encoding (including UTF-8) would contain a zero byte, so the presence of zero bytes is a strong enough hint for UTF-16 (or

Re: Detecting encoding in Plain text

2004-01-12 Thread Mark Davis
a long way. Mark __ http://www.macchiato.com - Original Message - From: Doug Ewell [EMAIL PROTECTED] To: Unicode Mailing List [EMAIL PROTECTED] Cc: Brijesh Sharma [EMAIL PROTECTED] Sent: Sun, 2004 Jan 11 21:48 Subject: Re: Detecting encoding in Plain

Re: Detecting encoding in Plain text

2004-01-12 Thread Doug Ewell
Marco Cimarosti marco dot cimarosti at essetre dot it wrote: In UTF-16 practically any sequence of bytes is valid, and since you can't assume you know the language, you can't employ distribution statistics. Twelve years ago, when most text was not Unicode and all Unicode text was UTF-16,

RE: Detecting encoding in Plain text

2004-01-12 Thread Tom Emerson
Perhaps a meta question is this: how often are you going to encounter unBOMed UTF-32 or UTF-16 text? It's pretty rare --- certainly I've never seen it during the development of our language/encoding identifier. Sure, it's an interesting thought problem, but it doesn't happen. And fortunately

Re: Detecting encoding in Plain text

2004-01-12 Thread Curtis Clark
on 2004-01-12 08:57 Tom Emerson wrote: You also have to deal with oddities of language: I tried one open source implementation of the Cavnar and Trenkel algorithm THAT CLAIMED THAT SHOUTED ENGLISH WAS ACTUALLY CZECH. SHOUTED AT CLOSE RANGE (~ 1 CM FROM THE EAR) AND WITH A CZECH ACCENT, IT SOUNDS

Re: Detecting encoding in Plain text

2004-01-11 Thread Doug Ewell
Brijesh Sharma bssharma at quark dot co dot in wrote: I writing a small tool to get text from a txt file into a edit box. Now this txt file could be in any encoding for eg(UTF-8,UTF-16,Mac Roman,Windows ANSI,Western (ISO-8859-1),JIS,Shift-JIS etc) My problem is that I can distinguish between

Re: Detecting encoding in Plain text

2004-01-09 Thread Peter Jacobi
Katsuhiko Momoi wrote: The specific URL for our IUC 19 paper with an update note at the beginning is this: http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html from said paper: cite [UTF8] is inactive [SJIS] is inactive [EUCJP] detector has confidence 0.95 [GB2312]

Re: Detecting encoding in Plain text

2004-01-09 Thread Katsuhiko Momoi
Peter Jacobi wrote: Katsuhiko Momoi wrote: The specific URL for our IUC 19 paper with an update note at the beginning is this: http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html from said paper: cite [UTF8] is inactive [SJIS] is inactive [EUCJP] detector has confidence

Detecting encoding in Plain text

2004-01-08 Thread Brijesh Sharma
Hi All, I am new to Unicode. I writing a small tool to get text from a txt file into a edit box. Now this txt file could be in any encoding for eg(UTF-8,UTF-16,Mac Roman,Windows ANSI,Western (ISO-8859-1),JIS,Shift-JIS etc) My problem is that I can distinguish between UTF-8 or UTF-16 using the BOM.

Re: Detecting encoding in Plain text

2004-01-08 Thread John Delacour
At 12:09 pm + 8/1/04, [EMAIL PROTECTED] wrote: There is no foolproof way of differentiating between some of the encodings. While UTF-16 or UTF-8 with a BOM (such files don't necessarily start with a BOM by the way) stand out as being unlikely to be in any other encoding others are more

Re: Detecting encoding in Plain text

2004-01-08 Thread jon
I writing a small tool to get text from a txt file into a edit box. Now this txt file could be in any encoding for eg(UTF-8,UTF-16,Mac Roman,Windows ANSI,Western (ISO-8859-1),JIS,Shift-JIS etc) My problem is that I can distinguish between UTF-8 or UTF-16 using the BOM. But how do I auto

Re: Detecting encoding in Plain text

2004-01-08 Thread D. Starner
Given any sizeable chunk of text, it ought to be possible to estimate the statistical likelihood of its being in a certain encoding/[language] even if it's in an unspecified 8859-* encoding. It would be quite an interesting exercise, but I'd be surprised if someone hasn't done it before.

Re: Detecting encoding in Plain text

2004-01-08 Thread Patrick Andries
- Message d'origine - De: John Delacour [EMAIL PROTECTED] Given any sizeable chunk of text, it ought to be possible to estimate the statistical likelihood of its being in a certain encoding/[language] even if it's in an unspecified 8859-* encoding. It would be quite an

Re: Detecting encoding in Plain text

2004-01-08 Thread Tex Texin
There were also papers on the subject at past unicode conferences. Look for one by Martin Duerst several years ago and one by Kat Momoi, Netscape only a few years back. I think both are on the web. Also look at the Netscape open source code. I believe it does some detection. However, accuracy

RE: Detecting encoding in Plain text

2004-01-08 Thread Chris Pratley
List Subject: Detecting encoding in Plain text Hi All, I am new to Unicode. I writing a small tool to get text from a txt file into a edit box. Now this txt file could be in any encoding for eg(UTF-8,UTF-16,Mac Roman,Windows ANSI,Western (ISO-8859-1),JIS,Shift-JIS etc) My problem is that I can

Re: Detecting encoding in Plain text

2004-01-08 Thread Jungshik Shin
On Thu, 8 Jan 2004, Tex Texin wrote: There were also papers on the subject at past unicode conferences. Look for one by Martin Duerst several years ago and one by Kat Momoi, Netscape only a few years back. I think both are on the web. Also look at the Netscape open source code. I believe it

Re: Detecting encoding in Plain text

2004-01-08 Thread Katsuhiko Momoi
Jungshik Shin wrote: On Thu, 8 Jan 2004, Tex Texin wrote: There were also papers on the subject at past unicode conferences. Look for one by Martin Duerst several years ago and one by Kat Momoi, Netscape only a few years back. I think both are on the web. Also look at the Netscape