Re: Mozilla Charset Detectors

2017-05-22 Thread Jonathan Kew

On 22/05/2017 10:13, Gabriel Sandor wrote:

Greetings,

I recently came across the Mozilla Charset Detectors tool, at
https://www-archive.mozilla.org/projects/intl/chardet.html. I'm working on
a C# project where I could use a port of this library (e.g.
https://github.com/errepi/ude) for advanced charset detection.

I'm not sure however if this tool is deprecated or not, and still
recommended by Mozilla for use in modern applications. The tool page is
archived and most of the links are dead, while the code seems to be at
least 7-8 years old. Could you please tell me what's the status of this
tool and whether I should use it in my project or look for something else?


I'd suggest looking at ICU, for a modern, actively-maintained library 
that can probably help you:


 http://userguide.icu-project.org/conversion/detection

Or there's also https://github.com/google/compact_enc_det (as mentioned 
in the ICU doc).


JK
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Mozilla Charset Detectors

2017-05-22 Thread Henri Sivonen
On Mon, May 22, 2017 at 12:13 PM, Gabriel Sandor
 wrote:
> I recently came across the Mozilla Charset Detectors tool, at
> https://www-archive.mozilla.org/projects/intl/chardet.html. I'm working on
> a C# project where I could use a port of this library (e.g.
> https://github.com/errepi/ude) for advanced charset detection.

It's somewhat unfortunate that chardet got ported over to languages
like Python and C# with its shortcomings. The main shortcoming is that
despite the name saying "universal", the detector was rather arbitrary
in what it detected and what it didn't. Why Hebrew and Thai but not
Arabic or Vietnamese? Why have a Hungarian-specific frequency model
(that didn't actually work) but no models for e.g. Polish and Czech
from the same legacy encoding family?

The remaining detector bits in Firefox are for Japanese, Russian and
Ukrainian only, and I strongly suspect that the Russian and Ukrainian
detectors are doing more harm than good.

> I'm not sure however if this tool is deprecated or not, and still
> recommended by Mozilla for use in modern applications. The tool page is
> archived and most of the links are dead, while the code seems to be at
> least 7-8 years old. Could you please tell me what's the status of this
> tool and whether I should use it in my project or look for something else?

I recommend not using it. (I removed most of it from Firefox.)

I recommend avoiding heuristic detection unless your project
absolutely can't do without. If you *really* need a detector, ICU and
https://github.com/google/compact_enc_det/ might be worth looking at,
though this shouldn't be read as an endorsement of either.

With both ICU and https://github.com/google/compact_enc_det/ , watch
out for the detector's possible guess space containing very rarely
used encodings that you really don't want content detected as by
mistake.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Mozilla Charset Detectors

2017-05-23 Thread Gabriel Sandor
Hello Henri,

I was afraid this might be the case, so the library really is deprecated.

The project i'm working on implies multi-lingual environment, users, and
files, so yes, having a good encoding detector is important. Thanks for the
alternate recommendations, i see that they are C/C++ libraries but in
theory they can be wrapped into a managed C++.NET assembly and consumed by
a C# project. I haven't seen yet any existing C# ports that also handle
charset detection.

On Mon, May 22, 2017 at 5:49 PM, Henri Sivonen  wrote:

> On Mon, May 22, 2017 at 12:13 PM, Gabriel Sandor
>  wrote:
> > I recently came across the Mozilla Charset Detectors tool, at
> > https://www-archive.mozilla.org/projects/intl/chardet.html. I'm working
> on
> > a C# project where I could use a port of this library (e.g.
> > https://github.com/errepi/ude) for advanced charset detection.
>
> It's somewhat unfortunate that chardet got ported over to languages
> like Python and C# with its shortcomings. The main shortcoming is that
> despite the name saying "universal", the detector was rather arbitrary
> in what it detected and what it didn't. Why Hebrew and Thai but not
> Arabic or Vietnamese? Why have a Hungarian-specific frequency model
> (that didn't actually work) but no models for e.g. Polish and Czech
> from the same legacy encoding family?
>
> The remaining detector bits in Firefox are for Japanese, Russian and
> Ukrainian only, and I strongly suspect that the Russian and Ukrainian
> detectors are doing more harm than good.
>
> > I'm not sure however if this tool is deprecated or not, and still
> > recommended by Mozilla for use in modern applications. The tool page is
> > archived and most of the links are dead, while the code seems to be at
> > least 7-8 years old. Could you please tell me what's the status of this
> > tool and whether I should use it in my project or look for something
> else?
>
> I recommend not using it. (I removed most of it from Firefox.)
>
> I recommend avoiding heuristic detection unless your project
> absolutely can't do without. If you *really* need a detector, ICU and
> https://github.com/google/compact_enc_det/ might be worth looking at,
> though this shouldn't be read as an endorsement of either.
>
> With both ICU and https://github.com/google/compact_enc_det/ , watch
> out for the detector's possible guess space containing very rarely
> used encodings that you really don't want content detected as by
> mistake.
>
> --
> Henri Sivonen
> hsivo...@hsivonen.fi
> https://hsivonen.fi/
>
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Mozilla Charset Detectors

2017-05-23 Thread Joshua Cranmer 🐧

On 5/23/17 2:58 AM, Gabriel Sandor wrote:

Hello Henri,

I was afraid this might be the case, so the library really is deprecated.

The project i'm working on implies multi-lingual environment, users, and
files, so yes, having a good encoding detector is important. Thanks for the
alternate recommendations, i see that they are C/C++ libraries but in
theory they can be wrapped into a managed C++.NET assembly and consumed by
a C# project. I haven't seen yet any existing C# ports that also handle
charset detection.


You only need charset detection if you can't get reliable charsets 
passed around. Most word processing formats embed the charset they use 
in the document (or just use UTF-8 unconditionally), so you only need 
charset detection if you're getting lots of multilingual plain text (or 
plain text-ish formats like markdown or HTML), and even then, only if 
you expect the charset information to be unreliable. It's also worth 
pointing out that letting users override the charset information on a 
per-file basis goes a very long way to avoiding the need for charset 
detection.


--
Joshua Cranmer
Thunderbird and DXR developer
Source code archæologist

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Mozilla Charset Detectors

2017-05-25 Thread gabi . t . sandor
On Tuesday, May 23, 2017 at 7:47:12 PM UTC+3, Joshua Cranmer 🐧 wrote:
> On 5/23/17 2:58 AM, Gabriel Sandor wrote:
> > Hello Henri,
> >
> > I was afraid this might be the case, so the library really is deprecated.
> >
> > The project i'm working on implies multi-lingual environment, users, and
> > files, so yes, having a good encoding detector is important. Thanks for the
> > alternate recommendations, i see that they are C/C++ libraries but in
> > theory they can be wrapped into a managed C++.NET assembly and consumed by
> > a C# project. I haven't seen yet any existing C# ports that also handle
> > charset detection.
> 
> You only need charset detection if you can't get reliable charsets 
> passed around. Most word processing formats embed the charset they use 
> in the document (or just use UTF-8 unconditionally), so you only need 
> charset detection if you're getting lots of multilingual plain text (or 
> plain text-ish formats like markdown or HTML), and even then, only if 
> you expect the charset information to be unreliable. It's also worth 
> pointing out that letting users override the charset information on a 
> per-file basis goes a very long way to avoiding the need for charset 
> detection.
> 
> -- 
> Joshua Cranmer
> Thunderbird and DXR developer
> Source code archæologist

There are cases when i'm dealing with files that don't explicitly mention the 
charset. Think of XML files without the "encoding" attribute in the declaration 
or HTML files without the meta charset tag. Or plain text files in arbitrary 
languages. Many of these are not UTF-8. So there are indeed situations when 
heuristic encoding detection is needed.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Mozilla Charset Detectors

2017-05-26 Thread Henri Sivonen
On Thu, May 25, 2017 at 10:44 PM,   wrote:
>  Think of XML files without the "encoding" attribute in the declaration or 
> HTML files without the meta charset tag.

Per spec, these must be treated as UTF-16 if there's a UTF-16 BOM and
as UTF-8 otherwise. It's highly inappropriate to run heuristic
detection for XML.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Mozilla Charset Detectors

2017-05-26 Thread gabi . t . sandor
On Friday, May 26, 2017 at 10:01:18 AM UTC+3, Henri Sivonen wrote:
> >  Think of XML files without the "encoding" attribute in the declaration or 
> > HTML files without the meta charset tag.
> 
> Per spec, these must be treated as UTF-16 if there's a UTF-16 BOM and
> as UTF-8 otherwise. It's highly inappropriate to run heuristic
> detection for XML.
> 
> -- 
> Henri Sivonen
> https://hsivonen.fi/

Still, sometimes XML fragments come up and even if they are not 100% XML spec 
compliant i still have to process them. This includes encoding detection as 
well, when the XML declaration is missing from the fragments.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Mozilla Charset Detectors

2017-05-26 Thread Daniel Veditz
On Fri, May 26, 2017 at 4:12 AM,  wrote:

> Still, sometimes XML fragments come up and even if they are not 100% XML
> spec compliant i still have to process them. This includes encoding
> detection as well, when the XML declaration is missing from the fragments.
>

​Where do the fragments come from? If you pulled them out of a document
then you should have a charset (even if we have to guess at the document
level). If you only get the fragments through an API the charset should be
passed along as an argument to the API, otherwise treat them as Henri
described above.

-Dan Veditz
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Mozilla Charset Detectors

2017-05-30 Thread Gabriel Sandor
They can come from arbitrary sources that are out of my control. Therefore
i may not get the charset of the original document, so all i'm left with is
heuristic detection for those fragments. The application must be able to
deal with any XML it receives, it doesn't impose any particular structure
or content (think of XML editors like Notepad++).

Besides XML, there are also plain text files, which don't really have a
standard way of declaring all possible encodings.

No matter how much i'd like to avoid it, there are cases when heuristic
encoding detection is the only option.

On Fri, May 26, 2017 at 9:45 PM, Daniel Veditz  wrote:

> On Fri, May 26, 2017 at 4:12 AM,  wrote:
>
>> Still, sometimes XML fragments come up and even if they are not 100% XML
>> spec compliant i still have to process them. This includes encoding
>> detection as well, when the XML declaration is missing from the fragments.
>>
>
> ​Where do the fragments come from? If you pulled them out of a document
> then you should have a charset (even if we have to guess at the document
> level). If you only get the fragments through an API the charset should be
> passed along as an argument to the API, otherwise treat them as Henri
> described above.
>
> -Dan Veditz
>
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform