subject:"RE\: Detecting encoding in Plain text"

RE: Detecting encoding in Plain text

2004-01-14 Thread Mike Ayers

Title: RE: Detecting encoding in Plain text






> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Frank Yung-Fong Tang


> Does Thai use CR and LF?


    If it's in HTML, then, like every other language, it need not.



/|/|ike

Re: Detecting encoding in Plain text

2004-01-14 Thread Peter Kirk

On 14/01/2004 15:35, Frank Yung-Fong Tang wrote:

Does Thai use CR and LF?

 

I hadn't forgotten this, as you will find if you look back over the 
whole thread. I would assume that some plain text might actually use the 
Unicode recommended line and paragraph separator characters, rather than 
CR and LF. Also some short text files might consist of one paragraph 
with no CR or LF. At least this could be true of English, you know a lot 
more Thai than me!

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Detecting encoding in Plain text

2004-01-14 Thread Frank Yung-Fong Tang

John Burger wrote on 1/14/2004, 7:16 AM:

 > Mark E. Shoulson wrote:
 >
 > > If it's a heuristic we're after, then why split hairs and try to make
 > > all the rules ourselves?  Get a big ol' mess of training data in as
 > > many languages as you can and hand it over to a class full of CS
 > > graduate students studying Machine Learning.
 >
 > Absolutely my reaction.  All of these suggested heuristics are great,
 > but would almost certainly simply fall out of a more rigorous approach
 > using a generative probabilistic model, or some other classification
 > technique.  Useful features would include n-graphs frequencies, as Mark
 > suggests, as well as lots of other things.  For particular
 > applications, you could use a cache model, e.g., using statistics from
 > other documents from the same web site, or other messages from the same
 > email address, or even generalizing across country-of-origin.
 > Additionally, I'm pretty sure that you could get some mileage out of
 > unsupervised data, that is, all of the documents in the training set
 > needn't be labeled with language/encoding.  And one thing we have a lot
 > of on the web is unsupervised data.
 >
 > I would be extremely surprised if such an approach couldn't achieve 99%
 > accuracy - and I really do mean 99%, or better.
 >
 > By the way, I still don't quite understand what's special about Thai.
 > Could someone elaborate?

For language other than Thai, Chinese and Japanese, you usually will see 
space between words. Therefore, you should see a high count of SPACE in 
your document. The SPACE for text in language other than Thai, Chinese 
and Japanese should occupy probably 10%-15% of the code point (just a 
guess, if the average lenght of word is 9 characters, you will get 10% 
SPACE, if it shorter, if the average is shoter, than the percentage of 
SPACE increase). But for Thai, Chinese and Japanese, space is not put in 
between words, and therefore the percentage of SPACE code point will be 
quite different. For Korean, it is hard to say, depend they are using 
IDEOGRAPH SPACE or SINGLE BYTE SPACE. Also, for Korea, it will depend on 
which normalization form they are using. The % of space will be 
different too because in one normalization form you will count one 
Korean characters as one unicode code point, but in the decomposed form, 
it may be count as 3.

Shanjian Lee and Kat Momoi implement a charset detector based on my 
early work and direction. They summarise it into a paper and present in 
Sept 11, 2001. see 
http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html for 
details. It talk about a different issue and problem.

 >
 > - John Burger
 >MITRE
 >
 >
 >

Re: Detecting encoding in Plain text

2004-01-14 Thread Frank Yung-Fong Tang

Does Thai use CR and LF?

Peter Kirk wrote on 1/14/2004, 8:12 AM:

 > On 14/01/2004 07:16, John Burger wrote:
 >
 > > ...
 > > By the way, I still don't quite understand what's special about Thai.
 > > Could someone elaborate?
 > >
 > I mentioned Thai because it is the only language I know of which does
 > not used SPACE, U+0020. It also has at least some of its own
 > punctuation. So a Thai text need not include any characters U+00xx -
 > which rules out one suggested heuristic method.
 >
 > --
 > Peter Kirk
 > [EMAIL PROTECTED] (personal)
 > [EMAIL PROTECTED] (work)
 > http://www.qaya.org/
 >
 >
 >

Re: Detecting encoding in Plain text

2004-01-14 Thread Frank Yung-Fong Tang

Consider CR and LF too.

Mark Davis wrote on 1/14/2004, 9:25 AM:

 > I'm not sure which "one suggested heuristic method" you are referring
 > to, but
 > you are bounding to conclusions. For example, one of the heuristics is
 > to judge
 > what are more common characters when bytes are interpreted as if they
 > were in
 > different encoding schemes. When picking between UTF16-BE and LE,
 > U+0020 is
 > *still* much more common than U+2000, even in Thai.
 >
 > Mark
 > __
 > http://www.macchiato.com
 > â à â
 >
 > - Original Message -
 > From: "Peter Kirk" <[EMAIL PROTECTED]>
 > To: "John Burger" <[EMAIL PROTECTED]>
 > Cc: <[EMAIL PROTECTED]>
 > Sent: Wed, 2004 Jan 14 08:12
 > Subject: Re: Detecting encoding in Plain text
 >
 >
 > > On 14/01/2004 07:16, John Burger wrote:
 > >
 > > > ...
 > > > By the way, I still don't quite understand what's special about Thai.
 > > > Could someone elaborate?
 > > >
 > > I mentioned Thai because it is the only language I know of which does
 > > not used SPACE, U+0020. It also has at least some of its own
 > > punctuation. So a Thai text need not include any characters U+00xx -
 > > which rules out one suggested heuristic method.
 > >
 > > --
 > > Peter Kirk
 > > [EMAIL PROTECTED] (personal)
 > > [EMAIL PROTECTED] (work)
 > > http://www.qaya.org/
 > >
 > >
 > >
 > >
 >
 >

Re: Detecting encoding in Plain text

2004-01-14 Thread Peter Kirk

On 14/01/2004 09:25, Mark Davis wrote:

I'm not sure which "one suggested heuristic method" you are referring to, ...

Basically the one that in UTF-16 there are likely to be many zero bytes 
in either odd or even positions.

... but
you are bounding to conclusions. For example, one of the heuristics is to judge
what are more common characters when bytes are interpreted as if they were in
different encoding schemes. When picking between UTF16-BE and LE, U+0020 is
*still* much more common than U+2000, even in Thai.
 

Not necessarily. In certain texts neither might occur at all, so the 
heuristic fails.

I agree with Mark S and others that more sophisticated methods are 
likely to be safer.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Detecting encoding in Plain text

2004-01-14 Thread Mark Davis

I'm not sure which "one suggested heuristic method" you are referring to, but
you are bounding to conclusions. For example, one of the heuristics is to judge
what are more common characters when bytes are interpreted as if they were in
different encoding schemes. When picking between UTF16-BE and LE, U+0020 is
*still* much more common than U+2000, even in Thai.

Mark
__
http://www.macchiato.com
â à â

- Original Message - 
From: "Peter Kirk" <[EMAIL PROTECTED]>
To: "John Burger" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Wed, 2004 Jan 14 08:12
Subject: Re: Detecting encoding in Plain text

> On 14/01/2004 07:16, John Burger wrote:
>
> > ...
> > By the way, I still don't quite understand what's special about Thai.
> > Could someone elaborate?
> >
> I mentioned Thai because it is the only language I know of which does
> not used SPACE, U+0020. It also has at least some of its own
> punctuation. So a Thai text need not include any characters U+00xx -
> which rules out one suggested heuristic method.
>
> -- 
> Peter Kirk
> [EMAIL PROTECTED] (personal)
> [EMAIL PROTECTED] (work)
> http://www.qaya.org/
>
>
>
>

Re: detecting encoding in plain text (related to utf8)

2004-01-14 Thread Markus Scherer

Deepak Chand Rathore wrote:
unicode range
utf 8 encoded bytes
U- - U-007F:0xxx
U-0080 - U-07FF:110x 10xx   
U-0800 - U-:1110 10xx 10xx  
> ...

This table is not correct. Please check Unicode 3.2 or Unicode 4 for the correct table.

Table 3.1B. Legal UTF-8 Byte Sequences in 
http://www.unicode.org/reports/tr28/#3_1_conformance
Conformance chapter in http://www.unicode.org/versions/Unicode4.0.0/

But, there is one concern. In some cases the utf8 byte stream starts with a
BOM,( for eg. when we try reading bytes from a text file that
is saved using notepad (using utf8 option )in WIN2k, after first few bytes(
i suppose first 3 bytes), the actual text start.
So how do we detect whether the byte stream starts with a BOM or not ??
or the first few bytes represent BOM or the actual text ??
There is a whole FAQ section on this topic at http://www.unicode.org/faq/utf_bom.html#BOM

Best regards,
markus
--
Opinions expressed here may not reflect my company's positions unless otherwise noted.

Re: Detecting encoding in Plain text

2004-01-14 Thread Doug Ewell

John Burger  wrote:

> By the way, I still don't quite understand what's special about Thai.
> Could someone elaborate?

It was just a hypothetical example: "Suppose there's some relatively
obscure script, oh, I don't know, say Thai, that breaks these
assumptions..."  There isn't any evidence that Thai actually does break
them; it was just a what-if.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Detecting encoding in Plain text

2004-01-14 Thread Peter Kirk

On 14/01/2004 07:16, John Burger wrote:

...
By the way, I still don't quite understand what's special about Thai.  
Could someone elaborate?

I mentioned Thai because it is the only language I know of which does 
not used SPACE, U+0020. It also has at least some of its own 
punctuation. So a Thai text need not include any characters U+00xx - 
which rules out one suggested heuristic method.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Detecting encoding in Plain text

2004-01-14 Thread John Burger

Mark E. Shoulson wrote:

If it's a heuristic we're after, then why split hairs and try to make 
all the rules ourselves?  Get a big ol' mess of training data in as 
many languages as you can and hand it over to a class full of CS 
graduate students studying Machine Learning.
Absolutely my reaction.  All of these suggested heuristics are great, 
but would almost certainly simply fall out of a more rigorous approach 
using a generative probabilistic model, or some other classification 
technique.  Useful features would include n-graphs frequencies, as Mark 
suggests, as well as lots of other things.  For particular 
applications, you could use a cache model, e.g., using statistics from 
other documents from the same web site, or other messages from the same 
email address, or even generalizing across country-of-origin.  
Additionally, I'm pretty sure that you could get some mileage out of 
unsupervised data, that is, all of the documents in the training set 
needn't be labeled with language/encoding.  And one thing we have a lot 
of on the web is unsupervised data.

I would be extremely surprised if such an approach couldn't achieve 99% 
accuracy - and I really do mean 99%, or better.

By the way, I still don't quite understand what's special about Thai.  
Could someone elaborate?

- John Burger
  MITRE

Re: Detecting encoding in Plain text

2004-01-14 Thread Peter Kirk

On 13/01/2004 18:05, D. Starner wrote:

Peter Kirk writes:
 

I agree that heuristics should be adjusted for Thai. But problems may 
arise if they have to be adjusted individually, and without regression 
errors, for all 6000+ world languages.
   

Thai is hard because of the writing system. But most writing systems weren't
encoded pre-Unicode, so if they were typed into a computer, it was with
a Latin (or Cyrillic?) transliteration that probably used spaces and new lines,
and in fact was probably ASCII. 

More cynically, those who use obscure character sets or font encodings have 
trouble viewing them; that is one of the reasons for Unicode. That this tool 
may to some extent be an example of that problem is a simple fact of life, 
and doesn't call for it to be thrown out.
 

Either you are confused or I am. I was not referring to pre-Unicode 
legacy encodings. I was referring to Unicode plain text data which may 
(when Unicode includes all the necessary characters) be in any one of 
6000+ languages, some of which have a variety of scripts and spelling 
conventions. The problem is not that people are using obscure legacy 
encodings, but that they are not defining their UTF adequately.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: detecting encoding in plain text (related to utf8)

2004-01-14 Thread Doug Ewell

Deepak Chand Rathore  wrote:

> But, there is one concern. In some cases the utf8 byte stream starts
> with a BOM,( for eg. when we try reading bytes from a text file that
> is saved using notepad (using utf8 option )in WIN2k, after first few
> bytes( i suppose first 3 bytes), the actual text start.
> So how do we detect whether the byte stream starts with a BOM or
> not ??
> or the first few bytes represent BOM or the actual text ??

What you are asking is, if a UTF-8 byte stream starts with the character
U+FEFF, should that character be treated as a signature (BOM) or as a
zero-width no-break space?

You'll probably get different responses to this, having to do with
tagging or streams broken in the middle.  My view is that a zero-width
no-break space has *no business* appearing at the start of a text
stream.  With no character to precede it, what would it prevent a break
between?  U+FEFF, or specifically the bytes EF BB BF, at the true start
of a UTF-8 stream should be always interpreted as a signature.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/
 I don't speak for the Unicode Consortium.

Re: Detecting encoding in Plain text

2004-01-14 Thread D. Starner

- Original Message -
From: Peter Kirk <[EMAIL PROTECTED]>
Date: Tue, 13 Jan 2004 09:03:48 -0800
To: Doug Ewell <[EMAIL PROTECTED]>
Subject: Re: Detecting encoding in Plain text
On 13/01/2004 08:34, Doug Ewell wrote:

>Peter Kirk  wrote:
>
>  
>
>>>If a certain Unicode plain text file uses ASCII punctuation OR spaces
>>>OR end-of-line characters, AND the file is not too short or has a
>>>very odd formatting, then the algorithm should work.
>>>  
>>>
>>True. But there may be certain languages (perhaps Thai?) for which all
>>of these circumstances regularly occur together. It would be very
>>inconvenient for users of these languages if programs regularly
>>attribute the wrong encoding to their text.
>>
>>
>
>Whether this is specifically true for Thai or not -- and I doubt that
>the "short file or odd formatting" condition could ever be considered
>language-dependent -- I would say an otherwise-good heuristic that
>performs badly for Thai ought to have special cases built in for Thai,
>rather than being discarded.
>
>
>  
>
I may have confused you with what I wrote,  but my "all of these 
circumstances" referred not to "the "short file or odd formatting" 
condition", but to Marco's "*all* these circumstances", which you 
snipped, which were originally:

>Some scripts include their own digits and punctuation; not all scripts use spaces; 
and controls are not necessarily used, if U+2028 LINE SEPARATOR is used for new lines.
>
I agree that heuristics should be adjusted for Thai. But problems may 
arise if they have to be adjusted individually, and without regression 
errors, for all 6000+ world languages.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/


--
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Detecting encoding in Plain text

2004-01-14 Thread D. Starner

Peter Kirk writes:
I agree that heuristics should be adjusted for Thai. But problems may 
arise if they have to be adjusted individually, and without regression 
errors, for all 6000+ world languages.
Thai is hard because of the writing system. But most writing systems weren't
encoded pre-Unicode, so if they were typed into a computer, it was with
a Latin (or Cyrillic?) transliteration that probably used spaces and new lines,
and in fact was probably ASCII. 

More cynically, those who use obscure character sets or font encodings have 
trouble viewing them; that is one of the reasons for Unicode. That this tool 
may to some extent be an example of that problem is a simple fact of life, 
and doesn't call for it to be thrown out.

[If a reply to this message with no reply appeared, I'm sorry. Hit the enter
key in the wrong place and off it went.]
--
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

RE: detecting encoding in plain text (related to utf8)

2004-01-14 Thread Deepak Chand Rathore



Hi all,

Great to hear so many views on detecting encoding
I would also like to share something related to detecting UTF8 encoding
As most of u would be knowing, we can check any stream of bytes for utf8
encoding, if any of the following sequence of bytes appears.
If not , we simply consider it not to be in utf8

unicode range
utf 8 encoded bytes
U- - U-007F:0xxx
U-0080 - U-07FF:110x 10xx   
U-0800 - U-:1110 10xx 10xx  
U-0001 - U-001F:0xxx 10xx 10xx 10xx 
U-0020 - U-03FF:10xx 10xx 10xx 10xx 10xx

U-0400 - U-7FFF:110x 10xx 10xx 10xx 10xx
10xx
similarly using the above principle , we can write our own function that
converts wide char to utf8 & vice versa
according to me , this will work. ( am i right ??)
This approach will surely help as we don't have to rely on the library (for
eg. some utf8 functions require that the locale to be set to xxx.UTF-8
locale, so dependency on such locale)

But, there is one concern. In some cases the utf8 byte stream starts with a
BOM,( for eg. when we try reading bytes from a text file that
is saved using notepad (using utf8 option )in WIN2k, after first few bytes(
i suppose first 3 bytes), the actual text start.
So how do we detect whether the byte stream starts with a BOM or not ??
or the first few bytes represent BOM or the actual text ??

with regards
( DC )
deepak chand rathore

Re: Detecting encoding in Plain text

2004-01-13 Thread Mark E. Shoulson

On 01/13/04 05:40, Marco Cimarosti wrote:

Peter Kirk wrote:
 

This one also looks dangerous.
   

What do you mean by "dangerous"? This is an heuristic algorithm, so it is
only supposed to work always but only in some lucky cases.
If lucky cases average to, say, 20% or less then it is a bad and useless
algorithm; if they average to, say, 80% or more, then it is good and
useless. But you can't ask that it works in the 100% of cases, or it
wouldn't be heuristic anymore.
 

If it's a heuristic we're after, then why split hairs and try to make 
all the rules ourselves?  Get a big ol' mess of training data in as many 
languages as you can and hand it over to a class full of CS graduate 
students studying Machine Learning.  Throw it at some neural networks, 
go Bayesian with digraphs, whatever.  Analyzing multigraph frequency 
(say, strings of up to four characters) would probably do a pretty 
decent job just by itself.

~mark

Re: Detecting encoding in Plain text

2004-01-13 Thread Peter Kirk

On 13/01/2004 08:34, Doug Ewell wrote:

Peter Kirk  wrote:

 

If a certain Unicode plain text file uses ASCII punctuation OR spaces
OR end-of-line characters, AND the file is not too short or has a
very odd formatting, then the algorithm should work.
 

True. But there may be certain languages (perhaps Thai?) for which all
of these circumstances regularly occur together. It would be very
inconvenient for users of these languages if programs regularly
attribute the wrong encoding to their text.
   

Whether this is specifically true for Thai or not -- and I doubt that
the "short file or odd formatting" condition could ever be considered
language-dependent -- I would say an otherwise-good heuristic that
performs badly for Thai ought to have special cases built in for Thai,
rather than being discarded.
 

I may have confused you with what I wrote,  but my "all of these 
circumstances" referred not to "the "short file or odd formatting" 
condition", but to Marco's "*all* these circumstances", which you 
snipped, which were originally:

Some scripts include their own digits and punctuation; not all scripts use spaces; and controls are not necessarily used, if U+2028 LINE SEPARATOR is used for new lines.

I agree that heuristics should be adjusted for Thai. But problems may 
arise if they have to be adjusted individually, and without regression 
errors, for all 6000+ world languages.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Detecting encoding in Plain text

2004-01-13 Thread Doug Ewell

Peter Kirk  wrote:

>> If a certain Unicode plain text file uses ASCII punctuation OR spaces
>> OR end-of-line characters, AND the file is not too short or has a
>> very odd formatting, then the algorithm should work.
>
> True. But there may be certain languages (perhaps Thai?) for which all
> of these circumstances regularly occur together. It would be very
> inconvenient for users of these languages if programs regularly
> attribute the wrong encoding to their text.

Whether this is specifically true for Thai or not -- and I doubt that
the "short file or odd formatting" condition could ever be considered
language-dependent -- I would say an otherwise-good heuristic that
performs badly for Thai ought to have special cases built in for Thai,
rather than being discarded.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Detecting encoding in Plain text

2004-01-13 Thread Peter Kirk

On 13/01/2004 04:10, Marco Cimarosti wrote:

...

In this case (as in most other similar cases), you should rather blame the
people who send you e-mail without encoding declaration.
 

I get plenty of them. But then I assume that they default to ASCII or 
Windows-1252. Is there in fact a formal default for e-mail, HTML etc 
without encoding declaration?

...

I don't think that Thai would be such a case. Thai normally uses European
digits (the usage scope of Thai digits is probably similar to that of Roman
numerals in Western languages), some European punctuation (parentheses,
exclamation marks, hyphens, quotes), and spaces (although a Thai space has
the strength -- and hence the frequency -- of a Western semicolon).
 

In some English texts the combined frequency of digits, parentheses, 
exclamation marks, quotes and semicolons is minimal, so perhaps 
similarly for their Thai counterparts. Does Thai use the basic Latin 
hyphen as part of the spelling of common words? Apart from them there is 
no guarantee that any basic Latin characters will be used.

As a minimum, all languages should use line feed and/or new line as line
terminators, as Unicode's line and paragraph separators never caught on.
 

Yes, but has it caught on in some countries/languages/applications/OSs? 
And will it catch on in future? Anyway, some texts use very long 
paragraphs and so very few explicit line feeds etc.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Detecting encoding in Plain text

2004-01-13 Thread Marco Cimarosti

Peter Kirk wrote:
> >What do you mean by "dangerous"? This is an heuristic
> >algorithm, so it is only supposed to work always [...]

(I meant: "it is not supposed to work always")

> I would not consider an 80% algorithm to be very good - 
> depending on the circumstances etc. But if for example 20% of
> my incoming e-mails were detected with the wrong encoding and
> appeared on my screen as junk, [...]

In this case (as in most other similar cases), you should rather blame the
people who send you e-mail without encoding declaration.

Auto-detection should be the last resort, when you have no safest way of
determining the encoding.

> >Yes, but *all* these circumstances must occur together in 
> >order for the algorithm to be totally useless for *that*
> >language. [...]
> >
> True. But there may be certain languages (perhaps Thai?) for 
> which all of these circumstances regularly occur together.

I don't think that Thai would be such a case. Thai normally uses European
digits (the usage scope of Thai digits is probably similar to that of Roman
numerals in Western languages), some European punctuation (parentheses,
exclamation marks, hyphens, quotes), and spaces (although a Thai space has
the strength -- and hence the frequency -- of a Western semicolon).

As a minimum, all languages should use line feed and/or new line as line
terminators, as Unicode's line and paragraph separators never caught on.

_ Marco

Re: Detecting encoding in Plain text

2004-01-13 Thread Peter Kirk

On 13/01/2004 02:40, Marco Cimarosti wrote:

Peter Kirk wrote:
 

This one also looks dangerous.
   

What do you mean by "dangerous"? This is an heuristic algorithm, so it is
only supposed to work always but only in some lucky cases.
If lucky cases average to, say, 20% or less then it is a bad and useless
algorithm; if they average to, say, 80% or more, then it is good and
useless. But you can't ask that it works in the 100% of cases, or it
wouldn't be heuristic anymore.
 

I would not consider an 80% algorithm to be very good - depending on the 
circumstances etc. But if for example 20% of my incoming e-mails were 
detected with the wrong encoding and appeared on my screen as junk, and 
I had to manually adjust the encoding, I would not be very happy. I 
would probably prefer a manual selection method e.g. from a list.

Some scripts include their own 
digits and punctuation; not all scripts use spaces; and controls are not 
necessarily used, if U+2028 LINE SEPARATOR is used for new lines.
   

Yes, but *all* these circumstances must occur together in order for the
algorithm to be totally useless for *that* language.
If a certain Unicode plain text file uses ASCII punctuation OR spaces OR
end-of-line characters, AND the file is not too short or has a very odd
formatting, then the algorithm should work.
 

True. But there may be certain languages (perhaps Thai?) for which all 
of these circumstances regularly occur together. It would be very 
inconvenient for users of these languages if programs regularly 
attribute the wrong encoding to their text.

But there may be some characters U+??00 which are used rather 
commonly in a particular script and so occur commonly in
some text files.
   

And those text files will not be detected correctly, particularly if they
are very short: that's part of the game.
 

Even if they are very long, if they don't use Latin-1 at all as above. 
At least this shouldn't be a problem for Thai is U+0E00 is not used.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Detecting encoding in Plain text

2004-01-13 Thread Marco Cimarosti

Jon Hanna wrote:
> False positives can be caused by the use of U+ (which is 
> most often encoded as 0x00) which some applications do use
> in text files.

I have never seen such a thing, can you make an example?

I can't imagine any use for a NULL in a file apart terminating records or
strings but, of course, a file containing records or string is not what I
would call a "plain-text file", anyway not a "typical" plain-text file.

> The method can be used reliably with text files that are 
> guaranteed to contain large amounts of Latin-1

But the Latin-1 (or even just ASCII) range contains some characters which
are shared by most languages (space, new line and/or line feed, digits,
punctuation), so there should be a relatively large amount of Latin-1
characters in most cases.

Even scripts which have their own digits or punctuation often prefer
European digits punctuation, especially in computer usage. E.g., it suffices
to check a few websites (or even printed matter) in Arabic to see that
European digits are much more widespread than native digits.

_ Marco

RE: Detecting encoding in Plain text

2004-01-13 Thread Marco Cimarosti

Peter Kirk wrote:
> This one also looks dangerous.

What do you mean by "dangerous"? This is an heuristic algorithm, so it is
only supposed to work always but only in some lucky cases.

If lucky cases average to, say, 20% or less then it is a bad and useless
algorithm; if they average to, say, 80% or more, then it is good and
useless. But you can't ask that it works in the 100% of cases, or it
wouldn't be heuristic anymore.

> Some scripts include their own 
> digits and punctuation; not all scripts use spaces; and controls are not 
> necessarily used, if U+2028 LINE SEPARATOR is used for new lines.

Yes, but *all* these circumstances must occur together in order for the
algorithm to be totally useless for *that* language.

If a certain Unicode plain text file uses ASCII punctuation OR spaces OR
end-of-line characters, AND the file is not too short or has a very odd
formatting, then the algorithm should work.

> But there may be some characters U+??00 which are used rather 
> commonly in a particular script and so occur commonly in
> some text files.

And those text files will not be detected correctly, particularly if they
are very short: that's part of the game.

_ Marco

Re: Detecting encoding in Plain text

2004-01-12 Thread Curtis Clark

on 2004-01-12 08:57 Tom Emerson wrote:
You also have to deal with oddities of language: I tried one open
source implementation of the Cavnar and Trenkel algorithm THAT CLAIMED
THAT SHOUTED ENGLISH WAS ACTUALLY CZECH.
SHOUTED AT CLOSE RANGE (~ 1 CM FROM THE EAR) AND WITH A CZECH ACCENT, IT 
SOUNDS PRETTY MUCH THE SAME.

--
Curtis Clark  http://www.csupomona.edu/~jcclark/
Mockingbird Font Works  http://www.mockfont.com/

RE: Detecting encoding in Plain text

2004-01-12 Thread Tom Emerson

Perhaps a meta question is this: how often are you going to encounter
unBOMed UTF-32 or UTF-16 text? It's pretty rare --- certainly I've never
seen it during the development of our language/encoding identifier.

Sure, it's an interesting thought problem, but it doesn't happen.
And fortunately detecting UTF-8 is relatively easy.

The real problem is differentiating between the ISO 8859-x family and
EUC-CN vs. EUC-KR. These are wondefully ambiguous.

The key to doing this right is having _a_lot_ of valid training data.
You also have to deal with oddities of language: I tried one open
source implementation of the Cavnar and Trenkel algorithm THAT CLAIMED
THAT SHOUTED ENGLISH WAS ACTUALLY CZECH.

It's difficult to separate the language detection from the encoding
Detection when dealing with non-Unicode text.

-tree  

--
Tom Emerson  Basis Technology Corp.
Software Architect http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"

Re: Detecting encoding in Plain text

2004-01-12 Thread Doug Ewell

Marco Cimarosti  wrote:

>> In UTF-16 practically any sequence of bytes is valid, and since you
>> can't assume you know the language, you can't employ distribution
>> statistics.  Twelve years ago, when most text was not Unicode and all
>> Unicode text was UTF-16, Microsoft documentation suggested a
>> heuristic of checking every other byte to see if it was zero, which
>> of course would only work for Latin-1 text encoded in UTF-16.
>
> I beg to differ. IMHO, analyzing zero bytes is a viable for detecting
> BOM-less UTF-16 and UTF-32. BTW, I didn't know (and I don't quite
> care) that this method was suggested first by Microsoft: to me, it
> seems quite self-evident.

I was referring specifically to the technique of checking every other
byte for zero, not checking whether there were zeros at all.  Certainly,
if your only choices are UTF-16 and encodings that do not use zero
bytes, the first zero byte answers the question.

This is a specific case of the "ME state" described by Li and Momoi: a
byte sequence (0x00) is found which could only be in one of the possible
encodings.

> It is extremely unlikely that a text file encoded in any single- or
> multi-byte encoding (including UTF-8) would contain a zero byte, so
> the presence of zero bytes is a strong enough hint for UTF-16 (or
> UCS-2) or UTF-32.

Jon Hanna and Peter Kirk responded that U+ could occur in specific
types of text files used by certain applications, or in markup formats.
But it seems reasonable that in such cases, the process reading the file
would already know what format to expect.

> Of course, all this works only if it is true the basic assumption that
> the file is a plain text file: this method is not quite enough for
> telling apart text files from binary files.

Of course.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Detecting encoding in Plain text

2004-01-12 Thread Mark Davis

One thing I have done in the past that was along similar lines:

If you know that it is a UTF, and if you know that you support the latest
version of Unicode, then you can walk through the bytes in 7 parallel paths,
with each fetching a code point in each of the 7 encoding schemes and testing
it. If you hit an illegal sequence or unassigned code point, then you 'turn off'
that path. If you have a single path at any point, then jump to a faster routine
to do the rest of the conversion. (I actually had 8 paths, since I also could
have Latin-1.)

I never put in anything to settle the cases where you end up with more than one
path, except for a simple priority order. In those rare cases where necessary, I
suspect something simple like capturing the frequency of a some common
characters, such as new lines, space, and certain punctuation, and some uncommon
characters (most controls) would go a long way.

Mark
__
http://www.macchiato.com
â à â

- Original Message - 
From: "Doug Ewell" <[EMAIL PROTECTED]>
To: "Unicode Mailing List" <[EMAIL PROTECTED]>
Cc: "Brijesh Sharma" <[EMAIL PROTECTED]>
Sent: Sun, 2004 Jan 11 21:48
Subject: Re: Detecting encoding in Plain text


> Brijesh Sharma  wrote:
>
> > I writing a small tool to get text from a txt file into a edit box.
> > Now this txt file could be in any encoding for eg(UTF-8,UTF-16,Mac
> > Roman,Windows ANSI,Western (ISO-8859-1),JIS,Shift-JIS etc)
> > My problem is that I can distinguish between UTF-8 or UTF-16 using
> > the BOM.
> > But how do I auto detect the others.
> > Any kind of help will be appreciated.
>
> This has always been an interesting topic to me, even before the Unicode
> era.  The best information I have ever seen on this topic is Li and
> Momoi's paper.  To reiterate the URL:
>
> http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
>
> If you are "writing a small tool," however, you may not have the space
> or time to implement everything Li and Momoi described.
>
> You probably need to divide the problem into (1) detection of Unicode
> encodings and (2) detection of non-Unicode encodings, because these are
> really different problems.
>
> Detecting Unicode encodings, of course, is trivial if the stream begins
> with a signature (BOM, U+FEFF), but as Jon Hanna pointed out, you can't
> always count on the signature being present.  You need to rely primarily
> on what Li and Momoi call the "coding scheme method," searching for
> valid (and invalid) sequences in the various encoding schemes.  This
> works well for UTF-8 in particular; most non-contrived text that
> contains at least one valid multibyte UTF-8 sequence and no invalid
> UTF-8 sequences is very likely to be UTF-8.
>
> In UTF-16 practically any sequence of bytes is valid, and since you
> can't assume you know the language, you can't employ distribution
> statistics.  Twelve years ago, when most text was not Unicode and all
> Unicode text was UTF-16, Microsoft documentation suggested a heuristic
> of checking every other byte to see if it was zero, which of course
> would only work for Latin-1 text encoded in UTF-16.  If you need to
> detect the encoding of non-Western-European text, you would have to be
> more sophisticated than this.
>
> Here are some notes I've taken on detecting a byte stream known to be in
> a Unicode encoding scheme (UTF-8, UTF-16, UTF-32, or SCSU).  This is a
> work in progress and is not expected to be complete or perfect, so feel
> free to send corrections and enhancements but not flames:
>
> 0A 00
> â inverse of U+000A LINE FEED
> â U+0A00 = unassigned Gurmukhi code point
> â may indicate little-endian UTF-16
>
> 0A 0D
> â 8-bit line-feed + carriage return
> â U+0A0D = unassigned Gurmukhi code point
> â probably indicates 8-bit encoding
>
> 0D 00
> â inverse of U+000D CARRIAGE RETURN
> â U+0D00 = unassigned Malayalam code point
> â may indicate little-endian UTF-16
>
> 0D 0A
> â 8-bit carriage return + line feed
> â U+0D0A = MALAYALAM LETTER UU
>   â text should include other Malayalam characters (U+0D00âU+0D7F)
> â otherwise, probably indicates 8-bit encoding
>
> 20 00
> â inverse of U+0020 SPACE
> â U+2000 = EN QUAD (infrequent character)
> â may indicate UTF-16 (probably little-endian)
>
> 28 20
> â inverse of U+2028 LINE SEPARATOR
> â U+2820 = BRAILLE PATTERN DOTS-6
>   â text should include other Braille characters (U+2800âU+28FF)
> â may indicate little-endian UTF-16
> â but may also indicate 8-bit 20 28 or 28 20 (space + left parenthesis)
>
> E2 80 A8
> â UTF-8 representation of U+2028 LINE SEPARATOR
> â probably i

Re: Detecting encoding in Plain text

2004-01-12 Thread Philippe Verdy

From: "Peter Kirk" <[EMAIL PROTECTED]>
> On 12/01/2004 03:09, Marco Cimarosti wrote:
>
> > ...
> >
> >It is extremely unlikely that a text file encoded in any single- or
> >multi-byte encoding (including UTF-8) would contain a zero byte, so the
> >presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or
> >UTF-32.
> >
> >
> >
> Is it not dangerous to assume that U+ is not used? This is a valid
> character and is commonly used e.g. as a string terminator. Perhaps it
> should not be used in truly plain text. But it is likely to occur in
> files which are basically text but include certain kinds of markup.

This character is invalid at least in HTML, XML, XHTML, SGML and text/plain
files. It's presence in a file will just indicate that this is not a plain
text file,
so it could have any arbitrary supplementary content which does not use
any relevenat text encoding.

More precisely, I think it's safer to consider that any file that seems to
contain NUL characters is not a text file or, if it is really so, it uses a
non-8-bit Uncode encoding scheme like UTF-16 or UTF-32 or a legacy
16-bit charset.

Any attempt to try matching the file containing any NUL byte as a plain-text
file with a 8-bit charset should fail (at least if the autodetection is
needed
to parse an HTML or XML text file in a browser). Note that this check is
extended to the byte 0x01 which also unambiguously indicates that the file,
if it's really plain-text, cannot use a legacy 8bit charset but could be
matched
with UTF-16, UTF-32, SCSU or a legacy 16-bit charset.
(However I can't remember if this applies to VISCII: does it encode a
plain-text
Unicode character at position 0x01, instead of a C0 control?)

My opinion is that most C0 and C1 controls are used as part of an
out-of-band
protocol, and they are not valid and should not be present in plain text
files
once they have been decoded and converted to Unicode, where only a few
should remain: TAB, LF, FF, CR, NEL. Some controls are needed in encoded
plain-text files only for some encoding schemes, but they do not encode
actual characters after the encoding scheme has been parsed: BS, SO, SI,
ESC, DLE, SS2, SS3... If there's no specific precise support for these
legacy encoding schemes, there should not be any attempt to "detect" them by
assuming they could be present in a plain-text file.

Re: Detecting encoding in Plain text

2004-01-12 Thread Peter Kirk

On 12/01/2004 03:09, Marco Cimarosti wrote:

...

It is extremely unlikely that a text file encoded in any single- or
multi-byte encoding (including UTF-8) would contain a zero byte, so the
presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or
UTF-32.
 

Is it not dangerous to assume that U+ is not used? This is a valid 
character and is commonly used e.g. as a string terminator. Perhaps it 
should not be used in truly plain text. But it is likely to occur in 
files which are basically text but include certain kinds of markup.

... This is due to the fact that, in any language, shared characters in
the Latin-1 range (controls, space, digits, punctuation, etc.) should be
more frequent than occasional code points of form . ...
This one also looks dangerous. Some scripts include their own digits and 
punctuation; not all scripts use spaces; and controls are not 
necessarily used, if U+2028 LINE SEPARATOR is used for new lines. But 
there may be some characters U+??00 which are used rather commonly in a 
particular script and so occur commonly in some text files.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Detecting encoding in Plain text

2004-01-12 Thread jon

Quoting Marco Cimarosti <[EMAIL PROTECTED]>:

> Doug Ewell wrote:
> > In UTF-16 practically any sequence of bytes is valid, and since you
> > can't assume you know the language, you can't employ distribution
> > statistics.  Twelve years ago, when most text was not Unicode and all
> > Unicode text was UTF-16, Microsoft documentation suggested a heuristic
> > of checking every other byte to see if it was zero, which of course
> > would only work for Latin-1 text encoded in UTF-16.
> 
> I beg to differ. IMHO, analyzing zero bytes is a viable for detecting
> BOM-less UTF-16 and UTF-32. BTW, I didn't know (and I don't quite care) that
> this method was suggested first by Microsoft: to me, it seems quite
> self-evident.
> 
> It is extremely unlikely that a text file encoded in any single- or
> multi-byte encoding (including UTF-8) would contain a zero byte, so the
> presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or
> UTF-32.

False positives can be caused by the use of U+ (which is most often encoded 
as 0x00) which some applications do use in text files. Hence you need to look 
for sequences where there is a null octet every other octet, which increases 
the risk of false negatives:

False negatives can be caused by text that doesn't contain any Latin-1 
characters.

The method can be used reliably with text files that are guaranteed to contain 
large amounts of Latin-1 - in particular files for which certain ASCII 
characters are given an application-specific meaning; for instance XML and HTML 
files, comma-delimited files, tab-delimited files, vCards and so on. It can be 
particularly reliable in cases where certain ASCII characters will always begin 
the document (e.g. XML).

--
Jon Hanna

*Thought provoking quote goes here*

RE: Detecting encoding in Plain text

2004-01-12 Thread Marco Cimarosti

Doug Ewell wrote:
> In UTF-16 practically any sequence of bytes is valid, and since you
> can't assume you know the language, you can't employ distribution
> statistics.  Twelve years ago, when most text was not Unicode and all
> Unicode text was UTF-16, Microsoft documentation suggested a heuristic
> of checking every other byte to see if it was zero, which of course
> would only work for Latin-1 text encoded in UTF-16.

I beg to differ. IMHO, analyzing zero bytes is a viable for detecting
BOM-less UTF-16 and UTF-32. BTW, I didn't know (and I don't quite care) that
this method was suggested first by Microsoft: to me, it seems quite
self-evident.

It is extremely unlikely that a text file encoded in any single- or
multi-byte encoding (including UTF-8) would contain a zero byte, so the
presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or
UTF-32.

The next step is distinguishing between UTF-16 and UTF-32. A bullet-proof
negative heuristic for UTF-32, is that a text file *cannot* be UTF-32 unless
at least 1/4 of its bytes are zero. A positive heuristics for UTF-32 is
detecting sequences of two consecutive zero bytes, the first of which having
an odd index: as it is very unlikely that a UTF-16 file would a NULL
character, zero 16-bit words must be part of a UTF-32 character. The
combination of these two methods is pretty enough to tell apart UTF-16 and
UTF-32.

Once you determined whether the file is in UTF-16 or in UTF-32, a
statistical analysis of the *indexes* of zero bytes should be pretty enough
to determine the UTF's endianness. UTF-16 is likely to be little-endian if
zero bytes are more frequent at even indexes than at odd indexed, and vice
versa. This is due to the fact that, in any language, shared characters in
the Latin-1 range (controls, space, digits, punctuation, etc.) should be
more frequent than occasional code points of form . For UTF-32,
determining endianness is even simpler: if *all* bytes whose index is
divisible by 4 are zero, then it is little-endian, else it is big-endian.

Of course, all this works only if it is true the basic assumption that the
file is a plain text file: this method is not quite enough for telling apart
text files from binary files.

_ Marco

Re: Detecting encoding in Plain text

2004-01-12 Thread Philippe Verdy

From: "Doug Ewell" <[EMAIL PROTECTED]>
> In UTF-16 practically any sequence of bytes is valid, and since you
> can't assume you know the language, you can't employ distribution
> statistics.  Twelve years ago, when most text was not Unicode and all
> Unicode text was UTF-16, Microsoft documentation suggested a heuristic
> of checking every other byte to see if it was zero, which of course
> would only work for Latin-1 text encoded in UTF-16.  If you need to
> detect the encoding of non-Western-European text, you would have to be
> more sophisticated than this.

Here I completely disagree: even though mostly any 16bit values in UTF-16
are valid, they are NOT uniformly distributed. You'll see immediately that
even
and odd bytes have very distinct distribution, with the bytes representing
the
least significant bits of code units having a flatter distribution in a
wider range
than the other bytes which are distributed in very few byte values
(rarely more than 2 or 3 for European languages, or with a mostly flat
distribution
of some limited ranges for Korean or for Chinese).

Even today, where Unicode has more than 1 plane, UTF-16 is still easy to
determine, because you'll see sequences of bytes where any byte between
0xD8 and 0xDB is followed by 2 bytes where the second is between 0xDC
and 0xDF. The low bit of the positions of these two bytes reveals if it's
coded
with UTF16-BE or UTF16-LE, and then you can look at the effective ranges of
decoded UTF16 code units to detect unassigned or illegal codepoints which
would invalidate the UTF-16 possibility.

Re: Detecting encoding in Plain text

2004-01-11 Thread Doug Ewell

Brijesh Sharma  wrote:

> I writing a small tool to get text from a txt file into a edit box.
> Now this txt file could be in any encoding for eg(UTF-8,UTF-16,Mac
> Roman,Windows ANSI,Western (ISO-8859-1),JIS,Shift-JIS etc)
> My problem is that I can distinguish between UTF-8 or UTF-16 using
> the BOM.
> But how do I auto detect the others.
> Any kind of help will be appreciated.

This has always been an interesting topic to me, even before the Unicode
era.  The best information I have ever seen on this topic is Li and
Momoi's paper.  To reiterate the URL:

http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

If you are "writing a small tool," however, you may not have the space
or time to implement everything Li and Momoi described.

You probably need to divide the problem into (1) detection of Unicode
encodings and (2) detection of non-Unicode encodings, because these are
really different problems.

Detecting Unicode encodings, of course, is trivial if the stream begins
with a signature (BOM, U+FEFF), but as Jon Hanna pointed out, you can't
always count on the signature being present.  You need to rely primarily
on what Li and Momoi call the "coding scheme method," searching for
valid (and invalid) sequences in the various encoding schemes.  This
works well for UTF-8 in particular; most non-contrived text that
contains at least one valid multibyte UTF-8 sequence and no invalid
UTF-8 sequences is very likely to be UTF-8.

In UTF-16 practically any sequence of bytes is valid, and since you
can't assume you know the language, you can't employ distribution
statistics.  Twelve years ago, when most text was not Unicode and all
Unicode text was UTF-16, Microsoft documentation suggested a heuristic
of checking every other byte to see if it was zero, which of course
would only work for Latin-1 text encoded in UTF-16.  If you need to
detect the encoding of non-Western-European text, you would have to be
more sophisticated than this.

Here are some notes I've taken on detecting a byte stream known to be in
a Unicode encoding scheme (UTF-8, UTF-16, UTF-32, or SCSU).  This is a
work in progress and is not expected to be complete or perfect, so feel
free to send corrections and enhancements but not flames:

0A 00
â inverse of U+000A LINE FEED
â U+0A00 = unassigned Gurmukhi code point
â may indicate little-endian UTF-16

0A 0D
â 8-bit line-feed + carriage return
â U+0A0D = unassigned Gurmukhi code point
â probably indicates 8-bit encoding

0D 00
â inverse of U+000D CARRIAGE RETURN
â U+0D00 = unassigned Malayalam code point
â may indicate little-endian UTF-16

0D 0A
â 8-bit carriage return + line feed
â U+0D0A = MALAYALAM LETTER UU
  â text should include other Malayalam characters (U+0D00âU+0D7F)
â otherwise, probably indicates 8-bit encoding

20 00
â inverse of U+0020 SPACE
â U+2000 = EN QUAD (infrequent character)
â may indicate UTF-16 (probably little-endian)

28 20
â inverse of U+2028 LINE SEPARATOR
â U+2820 = BRAILLE PATTERN DOTS-6
  â text should include other Braille characters (U+2800âU+28FF)
â may indicate little-endian UTF-16
â but may also indicate 8-bit 20 28 or 28 20 (space + left parenthesis)

E2 80 A8
â UTF-8 representation of U+2028 LINE SEPARATOR
â probably indicates UTF-8

05 28
â SCSU representation of U+2028 LINE SEPARATOR
â U+0528 is unassigned
â U+2805 is BRAILLE PATTERN DOTS-13
  â should be surrounded by other Braille characters
â otherwise, probably indicates SCSU

00 00 00
â probably a Basic Latin character in UTF-32 (either byte order)

Detecting non-Unicode encodings is quite another matter, and here you
really need to study the techniques described by Li and Momoi.
Distinguishing straight ASCII from ISO 8859-1 from Windows-1252 is
easy -- just check which subsets of Windows-1252 are present -- but
throwing Mac Roman and East Asian double-byte sets into the mix is
another matter.

I once wrote a program to detect the encoding of a text sample known to
be in one of the following Cyrillic encodings:

â KOI8-R
â Windows code page 1251
â ISO 8859-5
â MS-DOS code page 866
â MS-DOS code page 855
â Mac Cyrillic

Given the Unicode scalar values corresponding to each byte value, the
program calculates the proportion of Cyrillic characters (as opposed to
punctuation and dingbats) when interpreted in each possible encoding,
and picks the encoding with the highest proportion (confidence level).
This is a dumbed-down version of Li and Momoi's character distribution
method, but works surprisingly well so long as the text really is in one
of these Cyrillic encodings.  It fails spectacularly for text in
Latin-1, Mac Roman, UTF-8, etc.  It would probably also be unable to
detect differences between almost-identical character sets, like KOI8-R
and KOI8-U.

The smaller your list of "possible" encodings, the easier your job of
detecting one of them.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Detecting encoding in Plain text

2004-01-09 Thread Katsuhiko Momoi

Peter Jacobi wrote:

Katsuhiko Momoi wrote:
 

The specific URL for our IUC 19 paper with an update note at the 
beginning is this:

http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
   

from said paper:

[UTF8] is inactive
[SJIS] is inactive
[EUCJP] detector has confidence 0.95
[GB2312] detector has confidence 0.150852
[EUCKR] is inactive
[Big5] detector has confidence 0.129412
[EUCTW] is inactive
[Windows-1251 ] detector has confidence 0.01
[KOI8-R] detector has confidence 0.01
[ISO-8859-5] detector has confidence 0.01
[x-mac-cyrillic] detector has confidence 0.01
[IBM866] detector has confidence 0.01
[IBM855] detector has confidence 0.01 


Is there any hidden preference in Mozilla to make this statistics
visible?
The first step is to use a debug build -- you need to build it from the 
source. This will allow debug output to a console. But as I recall, 
Shanjian disabled the output from Chardet at some point. Let me CC him 
so that he can tell you the rest.

- Kat

--
Katsuhiko Momoi
e-mail: [EMAIL PROTECTED]

Re: Detecting encoding in Plain text

2004-01-09 Thread Peter Jacobi

Katsuhiko Momoi wrote:
> The specific URL for our IUC 19 paper with an update note at the 
> beginning is this:
> 
> http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

from said paper:

[UTF8] is inactive
[SJIS] is inactive
[EUCJP] detector has confidence 0.95
[GB2312] detector has confidence 0.150852
[EUCKR] is inactive
[Big5] detector has confidence 0.129412
[EUCTW] is inactive
[Windows-1251 ] detector has confidence 0.01
[KOI8-R] detector has confidence 0.01
[ISO-8859-5] detector has confidence 0.01
[x-mac-cyrillic] detector has confidence 0.01
[IBM866] detector has confidence 0.01
[IBM855] detector has confidence 0.01 


Is there any hidden preference in Mozilla to make this statistics
visible?

Regards,
Peter Jacobi

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net

Re: Detecting encoding in Plain text

2004-01-08 Thread Katsuhiko Momoi

Jungshik Shin wrote:

On Thu, 8 Jan 2004, Tex Texin wrote:

 

There were also papers on the subject at past unicode conferences.
Look for one by Martin Duerst several years ago and one by Kat Momoi,
Netscape only a few years back. I think both are on the web.
   

 

Also look at the Netscape open source code. I believe it does some
detection.
   

 It's mozilla (not netscape) :-). See

http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/
http://lxr.mozilla.org/seamonkey/source/intl/chardet/
Li and Momoi paper presented at the 19th IUC is available there.
 

The specific URL for our IUC 19 paper with an update note at the 
beginning is this:

http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

- Kat

--
Katsuhiko Momoi
e-mail: [EMAIL PROTECTED]

Re: Detecting encoding in Plain text

2004-01-08 Thread Jungshik Shin

On Thu, 8 Jan 2004, Tex Texin wrote:

> There were also papers on the subject at past unicode conferences.
> Look for one by Martin Duerst several years ago and one by Kat Momoi,
> Netscape only a few years back. I think both are on the web.

> Also look at the Netscape open source code. I believe it does some
> detection.

  It's mozilla (not netscape) :-). See

http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/
http://lxr.mozilla.org/seamonkey/source/intl/chardet/

Li and Momoi paper presented at the 19th IUC is available there.

  Jungshik

RE: Detecting encoding in Plain text

2004-01-08 Thread Chris Pratley

If you are on the Windows platform, look at mlang.dll, and at the
IMultiLanguage2 and IMultiLanguage3 APIs, which provide this service. As
others have noted you will get false detections with too little or
ambiguous data, but you may be quite surprised at just how accurate this
detection is (sometimes just one character outside of the "ASCII"
repertoire), since there is language frequency data used as well as
merely encoding rules.

Chris

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
Behalf Of Brijesh Sharma
Sent: January 8, 2004 3:08 AM
To: Unicode Mailing List
Subject: Detecting encoding in Plain text

Hi All,
I am new to Unicode.
I writing a small tool to get text from a txt file into a edit box.
Now this txt file could be in any encoding for eg(UTF-8,UTF-16,Mac
Roman,Windows ANSI,Western (ISO-8859-1),JIS,Shift-JIS etc)
My problem is that I can distinguish between UTF-8 or UTF-16 using the
BOM.
But how do I auto detect the others.
Any kind of help will be appreciated.
 

Regards
Brijesh Sharma 


"You're not obligated to win. You're obligated to keep trying to do the
best
you can every day."

Re: Detecting encoding in Plain text

2004-01-08 Thread Tex Texin

There were also papers on the subject at past unicode conferences.
Look for one by Martin Duerst several years ago and one by Kat Momoi, Netscape
only a few years back.
I think both are on the web.

Also look at the Netscape open source code. I believe it does some detection.

However, accuracy can be greatly improved if you or the end-user can supply
some information about the likely nature of the data (language, platform, most
likely encoding possibilities, file formats, data format or content information
e.g. field of expertise, etc.)

tex

"D. Starner" wrote:
> 
> > Given any sizeable chunk of text, it ought to be possible to estimate
> > the statistical likelihood of its being in a certain
> > encoding/[language] even if it's in an unspecified 8859-* encoding.
> > It would be quite an interesting exercise, but I'd be surprised if
> > someone hasn't done it before.  Perhaps someone here knows.
> 
> http://www.let.rug.nl/~vannoord/TextCat/ has a paper on the subject
> and an implemenation in Perl. http://mnogosearch.org has an alternate
> implementation in compiled code (called mguesser).
> --
> ___
> Sign-up for Ads Free at Mail.com
> http://promo.mail.com/adsfreejump.htm

-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:[EMAIL PROTECTED]
Xen Master  http://www.i18nGuy.com

XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-

Re: Detecting encoding in Plain text

2004-01-08 Thread Patrick Andries


- Message d'origine - 
De: "John Delacour" <[EMAIL PROTECTED]>


> Given any sizeable chunk of text, it ought to be possible to estimate 
> the statistical likelihood of its being in a certain 
> encoding/[language] even if it's in an unspecified 8859-* encoding. 
> It would be quite an interesting exercise, but I'd be surprised if 
> someone hasn't done it before.  Perhaps someone here knows.

See 

http://www.alis.com/fr/services_que.html
http://www.alis.com/en/services_que.html

P. A.

Re: Detecting encoding in Plain text

2004-01-08 Thread D. Starner

> Given any sizeable chunk of text, it ought to be possible to estimate 
> the statistical likelihood of its being in a certain 
> encoding/[language] even if it's in an unspecified 8859-* encoding. 
> It would be quite an interesting exercise, but I'd be surprised if 
> someone hasn't done it before.  Perhaps someone here knows.

http://www.let.rug.nl/~vannoord/TextCat/ has a paper on the subject
and an implemenation in Perl. http://mnogosearch.org has an alternate
implementation in compiled code (called mguesser). 
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Detecting encoding in Plain text

2004-01-08 Thread jon

> I writing a small tool to get text from a txt file into a edit box.
> Now this txt file could be in any encoding for eg(UTF-8,UTF-16,Mac
> Roman,Windows ANSI,Western (ISO-8859-1),JIS,Shift-JIS etc)
> My problem is that I can distinguish between UTF-8 or UTF-16 using the BOM.
> But how do I auto detect the others.
> Any kind of help will be appreciated.

There is no foolproof way of differentiating between some of the encodings. 
While UTF-16 or UTF-8 with a BOM (such files don't necessarily start with a BOM 
by the way) "stand out" as being unlikely to be in any other encoding others 
are more troublesome.

If there is no source of encoding information (such as you get with xml 
declarations, HTTP headers and such), and even if there is, it may be best to 
offer your users the ability to select encodings (perhaps with the default 
choice based on locale settings).

--
Jon Hanna

*Thought provoking quote goes here*

Re: Detecting encoding in Plain text

2004-01-08 Thread John Delacour

At 12:09 pm + 8/1/04, [EMAIL PROTECTED] wrote:

There is no foolproof way of differentiating between some of the 
encodings.  While UTF-16 or UTF-8 with a BOM (such files don't 
necessarily start with a BOM by the way) "stand out" as being 
unlikely to be in any other encoding others are more troublesome.
Given any sizeable chunk of text, it ought to be possible to estimate 
the statistical likelihood of its being in a certain 
encoding/[language] even if it's in an unspecified 8859-* encoding. 
It would be quite an interesting exercise, but I'd be surprised if 
someone hasn't done it before.  Perhaps someone here knows.

JD

44 matches

Mail list logo