Hi all,
Great to hear so many views on detecting encoding
I would also like to share something related to detecting UTF8 encoding
As most of u would be knowing, we can check any stream of bytes for utf8
encoding, if any of the following sequence of bytes appears.
If not , we simply
Peter Kirk writes:
I agree that heuristics should be adjusted for Thai. But problems may
arise if they have to be adjusted individually, and without regression
errors, for all 6000+ world languages.
Thai is hard because of the writing system. But most writing systems weren't
encoded pre-Unicode,
- Original Message -
From: Peter Kirk [EMAIL PROTECTED]
Date: Tue, 13 Jan 2004 09:03:48 -0800
To: Doug Ewell [EMAIL PROTECTED]
Subject: Re: Detecting encoding in Plain text
On 13/01/2004 08:34, Doug Ewell wrote:
Peter Kirk peterkirk at qaya dot org wrote:
If a certain Unicode plain
Deepak Chand Rathore deepakr at aztec dot soft dot net wrote:
But, there is one concern. In some cases the utf8 byte stream starts
with a BOM,( for eg. when we try reading bytes from a text file that
is saved using notepad (using utf8 option )in WIN2k, after first few
bytes( i suppose first 3
On 13/01/2004 18:05, D. Starner wrote:
Peter Kirk writes:
I agree that heuristics should be adjusted for Thai. But problems may
arise if they have to be adjusted individually, and without regression
errors, for all 6000+ world languages.
Thai is hard because of the writing system. But
Mark E. Shoulson wrote:
If it's a heuristic we're after, then why split hairs and try to make
all the rules ourselves? Get a big ol' mess of training data in as
many languages as you can and hand it over to a class full of CS
graduate students studying Machine Learning.
Absolutely my
On 14/01/2004 07:16, John Burger wrote:
...
By the way, I still don't quite understand what's special about Thai.
Could someone elaborate?
I mentioned Thai because it is the only language I know of which does
not used SPACE, U+0020. It also has at least some of its own
punctuation. So a Thai
John Burger john at mitre dot org wrote:
By the way, I still don't quite understand what's special about Thai.
Could someone elaborate?
It was just a hypothetical example: Suppose there's some relatively
obscure script, oh, I don't know, say Thai, that breaks these
assumptions... There isn't
Deepak Chand Rathore wrote:
unicode range
utf 8 encoded bytes
U- - U-007F:0xxx
U-0080 - U-07FF:110x 10xx
U-0800 - U-:1110 10xx 10xx
...
This table is not
: Detecting encoding in Plain text
On 14/01/2004 07:16, John Burger wrote:
...
By the way, I still don't quite understand what's special about Thai.
Could someone elaborate?
I mentioned Thai because it is the only language I know of which does
not used SPACE, U+0020. It also has at least
On 14/01/2004 09:25, Mark Davis wrote:
I'm not sure which one suggested heuristic method you are referring to, ...
Basically the one that in UTF-16 there are likely to be many zero bytes
in either odd or even positions.
... but
you are bounding to conclusions. For example, one of the
]
To: John Burger [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Wed, 2004 Jan 14 08:12
Subject: Re: Detecting encoding in Plain text
On 14/01/2004 07:16, John Burger wrote:
...
By the way, I still don't quite understand what's special about Thai.
Could someone elaborate
Does Thai use CR and LF?
Peter Kirk wrote on 1/14/2004, 8:12 AM:
On 14/01/2004 07:16, John Burger wrote:
...
By the way, I still don't quite understand what's special about Thai.
Could someone elaborate?
I mentioned Thai because it is the only language I know of which does
John Burger wrote on 1/14/2004, 7:16 AM:
Mark E. Shoulson wrote:
If it's a heuristic we're after, then why split hairs and try to make
all the rules ourselves? Get a big ol' mess of training data in as
many languages as you can and hand it over to a class full of CS
graduate
On 14/01/2004 15:35, Frank Yung-Fong Tang wrote:
Does Thai use CR and LF?
I hadn't forgotten this, as you will find if you look back over the
whole thread. I would assume that some plain text might actually use the
Unicode recommended line and paragraph separator characters, rather than
CR
Title: RE: Detecting encoding in Plain text
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Frank Yung-Fong Tang
Does Thai use CR and LF?
If it's in HTML, then, like every other language, it need not.
/|/|ike
Peter Kirk wrote:
This one also looks dangerous.
What do you mean by dangerous? This is an heuristic algorithm, so it is
only supposed to work always but only in some lucky cases.
If lucky cases average to, say, 20% or less then it is a bad and useless
algorithm; if they average to, say, 80% or
Jon Hanna wrote:
False positives can be caused by the use of U+ (which is
most often encoded as 0x00) which some applications do use
in text files.
I have never seen such a thing, can you make an example?
I can't imagine any use for a NULL in a file apart terminating records or
strings
On 13/01/2004 02:40, Marco Cimarosti wrote:
Peter Kirk wrote:
This one also looks dangerous.
What do you mean by dangerous? This is an heuristic algorithm, so it is
only supposed to work always but only in some lucky cases.
If lucky cases average to, say, 20% or less then it is a bad and
Peter Kirk wrote:
What do you mean by dangerous? This is an heuristic
algorithm, so it is only supposed to work always [...]
(I meant: it is not supposed to work always)
I would not consider an 80% algorithm to be very good -
depending on the circumstances etc. But if for example 20% of
my
On 13/01/2004 04:10, Marco Cimarosti wrote:
...
In this case (as in most other similar cases), you should rather blame the
people who send you e-mail without encoding declaration.
I get plenty of them. But then I assume that they default to ASCII or
Windows-1252. Is there in fact a formal
Peter Kirk peterkirk at qaya dot org wrote:
If a certain Unicode plain text file uses ASCII punctuation OR spaces
OR end-of-line characters, AND the file is not too short or has a
very odd formatting, then the algorithm should work.
True. But there may be certain languages (perhaps Thai?)
On 13/01/2004 08:34, Doug Ewell wrote:
Peter Kirk peterkirk at qaya dot org wrote:
If a certain Unicode plain text file uses ASCII punctuation OR spaces
OR end-of-line characters, AND the file is not too short or has a
very odd formatting, then the algorithm should work.
True. But
On 01/13/04 05:40, Marco Cimarosti wrote:
Peter Kirk wrote:
This one also looks dangerous.
What do you mean by dangerous? This is an heuristic algorithm, so it is
only supposed to work always but only in some lucky cases.
If lucky cases average to, say, 20% or less then it is a bad and
From: Doug Ewell [EMAIL PROTECTED]
In UTF-16 practically any sequence of bytes is valid, and since you
can't assume you know the language, you can't employ distribution
statistics. Twelve years ago, when most text was not Unicode and all
Unicode text was UTF-16, Microsoft documentation
Doug Ewell wrote:
In UTF-16 practically any sequence of bytes is valid, and since you
can't assume you know the language, you can't employ distribution
statistics. Twelve years ago, when most text was not Unicode and all
Unicode text was UTF-16, Microsoft documentation suggested a heuristic
Quoting Marco Cimarosti [EMAIL PROTECTED]:
Doug Ewell wrote:
In UTF-16 practically any sequence of bytes is valid, and since you
can't assume you know the language, you can't employ distribution
statistics. Twelve years ago, when most text was not Unicode and all
Unicode text was
On 12/01/2004 03:09, Marco Cimarosti wrote:
...
It is extremely unlikely that a text file encoded in any single- or
multi-byte encoding (including UTF-8) would contain a zero byte, so the
presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or
UTF-32.
Is it not dangerous to
From: Peter Kirk [EMAIL PROTECTED]
On 12/01/2004 03:09, Marco Cimarosti wrote:
...
It is extremely unlikely that a text file encoded in any single- or
multi-byte encoding (including UTF-8) would contain a zero byte, so the
presence of zero bytes is a strong enough hint for UTF-16 (or
a long way.
Mark
__
http://www.macchiato.com
- Original Message -
From: Doug Ewell [EMAIL PROTECTED]
To: Unicode Mailing List [EMAIL PROTECTED]
Cc: Brijesh Sharma [EMAIL PROTECTED]
Sent: Sun, 2004 Jan 11 21:48
Subject: Re: Detecting encoding in Plain
Marco Cimarosti marco dot cimarosti at essetre dot it wrote:
In UTF-16 practically any sequence of bytes is valid, and since you
can't assume you know the language, you can't employ distribution
statistics. Twelve years ago, when most text was not Unicode and all
Unicode text was UTF-16,
Perhaps a meta question is this: how often are you going to encounter
unBOMed UTF-32 or UTF-16 text? It's pretty rare --- certainly I've never
seen it during the development of our language/encoding identifier.
Sure, it's an interesting thought problem, but it doesn't happen.
And fortunately
on 2004-01-12 08:57 Tom Emerson wrote:
You also have to deal with oddities of language: I tried one open
source implementation of the Cavnar and Trenkel algorithm THAT CLAIMED
THAT SHOUTED ENGLISH WAS ACTUALLY CZECH.
SHOUTED AT CLOSE RANGE (~ 1 CM FROM THE EAR) AND WITH A CZECH ACCENT, IT
SOUNDS
Brijesh Sharma bssharma at quark dot co dot in wrote:
I writing a small tool to get text from a txt file into a edit box.
Now this txt file could be in any encoding for eg(UTF-8,UTF-16,Mac
Roman,Windows ANSI,Western (ISO-8859-1),JIS,Shift-JIS etc)
My problem is that I can distinguish between
Katsuhiko Momoi wrote:
The specific URL for our IUC 19 paper with an update note at the
beginning is this:
http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
from said paper:
cite
[UTF8] is inactive
[SJIS] is inactive
[EUCJP] detector has confidence 0.95
[GB2312]
Peter Jacobi wrote:
Katsuhiko Momoi wrote:
The specific URL for our IUC 19 paper with an update note at the
beginning is this:
http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
from said paper:
cite
[UTF8] is inactive
[SJIS] is inactive
[EUCJP] detector has confidence
Hi All,
I am new to Unicode.
I writing a small tool to get text from a txt file into a edit box.
Now this txt file could be in any encoding for eg(UTF-8,UTF-16,Mac
Roman,Windows ANSI,Western (ISO-8859-1),JIS,Shift-JIS etc)
My problem is that I can distinguish between UTF-8 or UTF-16 using the BOM.
At 12:09 pm + 8/1/04, [EMAIL PROTECTED] wrote:
There is no foolproof way of differentiating between some of the
encodings. While UTF-16 or UTF-8 with a BOM (such files don't
necessarily start with a BOM by the way) stand out as being
unlikely to be in any other encoding others are more
I writing a small tool to get text from a txt file into a edit box.
Now this txt file could be in any encoding for eg(UTF-8,UTF-16,Mac
Roman,Windows ANSI,Western (ISO-8859-1),JIS,Shift-JIS etc)
My problem is that I can distinguish between UTF-8 or UTF-16 using the BOM.
But how do I auto
Given any sizeable chunk of text, it ought to be possible to estimate
the statistical likelihood of its being in a certain
encoding/[language] even if it's in an unspecified 8859-* encoding.
It would be quite an interesting exercise, but I'd be surprised if
someone hasn't done it before.
- Message d'origine -
De: John Delacour [EMAIL PROTECTED]
Given any sizeable chunk of text, it ought to be possible to estimate
the statistical likelihood of its being in a certain
encoding/[language] even if it's in an unspecified 8859-* encoding.
It would be quite an
There were also papers on the subject at past unicode conferences.
Look for one by Martin Duerst several years ago and one by Kat Momoi, Netscape
only a few years back.
I think both are on the web.
Also look at the Netscape open source code. I believe it does some detection.
However, accuracy
List
Subject: Detecting encoding in Plain text
Hi All,
I am new to Unicode.
I writing a small tool to get text from a txt file into a edit box.
Now this txt file could be in any encoding for eg(UTF-8,UTF-16,Mac
Roman,Windows ANSI,Western (ISO-8859-1),JIS,Shift-JIS etc)
My problem is that I can
On Thu, 8 Jan 2004, Tex Texin wrote:
There were also papers on the subject at past unicode conferences.
Look for one by Martin Duerst several years ago and one by Kat Momoi,
Netscape only a few years back. I think both are on the web.
Also look at the Netscape open source code. I believe it
Jungshik Shin wrote:
On Thu, 8 Jan 2004, Tex Texin wrote:
There were also papers on the subject at past unicode conferences.
Look for one by Martin Duerst several years ago and one by Kat Momoi,
Netscape only a few years back. I think both are on the web.
Also look at the Netscape
45 matches
Mail list logo