Re: Mozilla Charset Detectors
On 22/05/2017 10:13, Gabriel Sandor wrote: Greetings, I recently came across the Mozilla Charset Detectors tool, at https://www-archive.mozilla.org/projects/intl/chardet.html. I'm working on a C# project where I could use a port of this library (e.g. https://github.com/errepi/ude) for advanced charset detection. I'm not sure however if this tool is deprecated or not, and still recommended by Mozilla for use in modern applications. The tool page is archived and most of the links are dead, while the code seems to be at least 7-8 years old. Could you please tell me what's the status of this tool and whether I should use it in my project or look for something else? I'd suggest looking at ICU, for a modern, actively-maintained library that can probably help you: http://userguide.icu-project.org/conversion/detection Or there's also https://github.com/google/compact_enc_det (as mentioned in the ICU doc). JK ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Mozilla Charset Detectors
On Mon, May 22, 2017 at 12:13 PM, Gabriel Sandor wrote: > I recently came across the Mozilla Charset Detectors tool, at > https://www-archive.mozilla.org/projects/intl/chardet.html. I'm working on > a C# project where I could use a port of this library (e.g. > https://github.com/errepi/ude) for advanced charset detection. It's somewhat unfortunate that chardet got ported over to languages like Python and C# with its shortcomings. The main shortcoming is that despite the name saying "universal", the detector was rather arbitrary in what it detected and what it didn't. Why Hebrew and Thai but not Arabic or Vietnamese? Why have a Hungarian-specific frequency model (that didn't actually work) but no models for e.g. Polish and Czech from the same legacy encoding family? The remaining detector bits in Firefox are for Japanese, Russian and Ukrainian only, and I strongly suspect that the Russian and Ukrainian detectors are doing more harm than good. > I'm not sure however if this tool is deprecated or not, and still > recommended by Mozilla for use in modern applications. The tool page is > archived and most of the links are dead, while the code seems to be at > least 7-8 years old. Could you please tell me what's the status of this > tool and whether I should use it in my project or look for something else? I recommend not using it. (I removed most of it from Firefox.) I recommend avoiding heuristic detection unless your project absolutely can't do without. If you *really* need a detector, ICU and https://github.com/google/compact_enc_det/ might be worth looking at, though this shouldn't be read as an endorsement of either. With both ICU and https://github.com/google/compact_enc_det/ , watch out for the detector's possible guess space containing very rarely used encodings that you really don't want content detected as by mistake. -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/ ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Mozilla Charset Detectors
Hello Henri, I was afraid this might be the case, so the library really is deprecated. The project i'm working on implies multi-lingual environment, users, and files, so yes, having a good encoding detector is important. Thanks for the alternate recommendations, i see that they are C/C++ libraries but in theory they can be wrapped into a managed C++.NET assembly and consumed by a C# project. I haven't seen yet any existing C# ports that also handle charset detection. On Mon, May 22, 2017 at 5:49 PM, Henri Sivonen wrote: > On Mon, May 22, 2017 at 12:13 PM, Gabriel Sandor > wrote: > > I recently came across the Mozilla Charset Detectors tool, at > > https://www-archive.mozilla.org/projects/intl/chardet.html. I'm working > on > > a C# project where I could use a port of this library (e.g. > > https://github.com/errepi/ude) for advanced charset detection. > > It's somewhat unfortunate that chardet got ported over to languages > like Python and C# with its shortcomings. The main shortcoming is that > despite the name saying "universal", the detector was rather arbitrary > in what it detected and what it didn't. Why Hebrew and Thai but not > Arabic or Vietnamese? Why have a Hungarian-specific frequency model > (that didn't actually work) but no models for e.g. Polish and Czech > from the same legacy encoding family? > > The remaining detector bits in Firefox are for Japanese, Russian and > Ukrainian only, and I strongly suspect that the Russian and Ukrainian > detectors are doing more harm than good. > > > I'm not sure however if this tool is deprecated or not, and still > > recommended by Mozilla for use in modern applications. The tool page is > > archived and most of the links are dead, while the code seems to be at > > least 7-8 years old. Could you please tell me what's the status of this > > tool and whether I should use it in my project or look for something > else? > > I recommend not using it. (I removed most of it from Firefox.) > > I recommend avoiding heuristic detection unless your project > absolutely can't do without. If you *really* need a detector, ICU and > https://github.com/google/compact_enc_det/ might be worth looking at, > though this shouldn't be read as an endorsement of either. > > With both ICU and https://github.com/google/compact_enc_det/ , watch > out for the detector's possible guess space containing very rarely > used encodings that you really don't want content detected as by > mistake. > > -- > Henri Sivonen > hsivo...@hsivonen.fi > https://hsivonen.fi/ > ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Mozilla Charset Detectors
On 5/23/17 2:58 AM, Gabriel Sandor wrote: Hello Henri, I was afraid this might be the case, so the library really is deprecated. The project i'm working on implies multi-lingual environment, users, and files, so yes, having a good encoding detector is important. Thanks for the alternate recommendations, i see that they are C/C++ libraries but in theory they can be wrapped into a managed C++.NET assembly and consumed by a C# project. I haven't seen yet any existing C# ports that also handle charset detection. You only need charset detection if you can't get reliable charsets passed around. Most word processing formats embed the charset they use in the document (or just use UTF-8 unconditionally), so you only need charset detection if you're getting lots of multilingual plain text (or plain text-ish formats like markdown or HTML), and even then, only if you expect the charset information to be unreliable. It's also worth pointing out that letting users override the charset information on a per-file basis goes a very long way to avoiding the need for charset detection. -- Joshua Cranmer Thunderbird and DXR developer Source code archæologist ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Mozilla Charset Detectors
On Tuesday, May 23, 2017 at 7:47:12 PM UTC+3, Joshua Cranmer 🐧 wrote: > On 5/23/17 2:58 AM, Gabriel Sandor wrote: > > Hello Henri, > > > > I was afraid this might be the case, so the library really is deprecated. > > > > The project i'm working on implies multi-lingual environment, users, and > > files, so yes, having a good encoding detector is important. Thanks for the > > alternate recommendations, i see that they are C/C++ libraries but in > > theory they can be wrapped into a managed C++.NET assembly and consumed by > > a C# project. I haven't seen yet any existing C# ports that also handle > > charset detection. > > You only need charset detection if you can't get reliable charsets > passed around. Most word processing formats embed the charset they use > in the document (or just use UTF-8 unconditionally), so you only need > charset detection if you're getting lots of multilingual plain text (or > plain text-ish formats like markdown or HTML), and even then, only if > you expect the charset information to be unreliable. It's also worth > pointing out that letting users override the charset information on a > per-file basis goes a very long way to avoiding the need for charset > detection. > > -- > Joshua Cranmer > Thunderbird and DXR developer > Source code archæologist There are cases when i'm dealing with files that don't explicitly mention the charset. Think of XML files without the "encoding" attribute in the declaration or HTML files without the meta charset tag. Or plain text files in arbitrary languages. Many of these are not UTF-8. So there are indeed situations when heuristic encoding detection is needed. ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Mozilla Charset Detectors
On Thu, May 25, 2017 at 10:44 PM, wrote: > Think of XML files without the "encoding" attribute in the declaration or > HTML files without the meta charset tag. Per spec, these must be treated as UTF-16 if there's a UTF-16 BOM and as UTF-8 otherwise. It's highly inappropriate to run heuristic detection for XML. -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/ ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Mozilla Charset Detectors
On Friday, May 26, 2017 at 10:01:18 AM UTC+3, Henri Sivonen wrote: > > Think of XML files without the "encoding" attribute in the declaration or > > HTML files without the meta charset tag. > > Per spec, these must be treated as UTF-16 if there's a UTF-16 BOM and > as UTF-8 otherwise. It's highly inappropriate to run heuristic > detection for XML. > > -- > Henri Sivonen > https://hsivonen.fi/ Still, sometimes XML fragments come up and even if they are not 100% XML spec compliant i still have to process them. This includes encoding detection as well, when the XML declaration is missing from the fragments. ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Mozilla Charset Detectors
On Fri, May 26, 2017 at 4:12 AM, wrote: > Still, sometimes XML fragments come up and even if they are not 100% XML > spec compliant i still have to process them. This includes encoding > detection as well, when the XML declaration is missing from the fragments. > Where do the fragments come from? If you pulled them out of a document then you should have a charset (even if we have to guess at the document level). If you only get the fragments through an API the charset should be passed along as an argument to the API, otherwise treat them as Henri described above. -Dan Veditz ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Mozilla Charset Detectors
They can come from arbitrary sources that are out of my control. Therefore i may not get the charset of the original document, so all i'm left with is heuristic detection for those fragments. The application must be able to deal with any XML it receives, it doesn't impose any particular structure or content (think of XML editors like Notepad++). Besides XML, there are also plain text files, which don't really have a standard way of declaring all possible encodings. No matter how much i'd like to avoid it, there are cases when heuristic encoding detection is the only option. On Fri, May 26, 2017 at 9:45 PM, Daniel Veditz wrote: > On Fri, May 26, 2017 at 4:12 AM, wrote: > >> Still, sometimes XML fragments come up and even if they are not 100% XML >> spec compliant i still have to process them. This includes encoding >> detection as well, when the XML declaration is missing from the fragments. >> > > Where do the fragments come from? If you pulled them out of a document > then you should have a charset (even if we have to guess at the document > level). If you only get the fragments through an API the charset should be > passed along as an argument to the API, otherwise treat them as Henri > described above. > > -Dan Veditz > ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform