[ark] [Bug 378904] Ark should use charset auto-detection for filenames

Alexander Trufanov Mon, 22 Apr 2019 04:33:39 -0700

https://bugs.kde.org/show_bug.cgi?id=378904


Alexander Trufanov <trufano...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |trufano...@gmail.com

--- Comment #3 from Alexander Trufanov <trufano...@gmail.com> ---
As I found out this is a very old problem with roots in ZIP specification. ZIP
can contain non-UTF filenames, UTF-8 filenames, or non-UTF filenames with
additional field that contain UTF-8 filename (since 2007). Same isapplied to
ZIP archive commentary.

The problem is that by design the non-UTF charset is IBM 437 charset which does
not support non-Western languages.
On practice Windows encode filenames with one of its DOS charsets (CP*), for
example for Russian it'll be CP866 (IBM 866). And there is no field in ZIP to
specify which exactly charset was used.
Even worse the fact that by default many Windows achievers don't use UTF-8
encoding but this DOS one.

As I understand ZIP authors don't want to fix this and suggesting everyone to
switch to UTF-8 for non-English systems.

There are several different patches, libs, tools proposed by developers to
workaround the problem decade ago.

Also maintainers of some linux systems patch zip/unzip tools in their systems
to workaround that. For example, here is discussion about unzip patch for
Ubuntu systems: https://bugs.launchpad.net/ubuntu/+source/unzip/+bug/580961/

Which end up with a patch that has been accepted for Ubuntu main branch. But
this take years.
I think this is a mirror of this patch:
https://github.com/zip-i18n/unzip/blob/master/debian/patches/20-unzip60-alt-iconv-utf8

As I can see from code they request locale from system and try to match it with
DOS charset based on hardcoded table. And additionally provide command line
args to allow user to specify the filename encoding by himself. I would say
their predefined encoding list is rather small and oriented to
russian-speakers. Or perhaps that's a wrong patch.

Anyway still no GUI archivers implemented something like that.

I don't believe much in automatic encoding detection. At least if one not bet
on fact that all non-UTF encodings coming from Win shall be CPxxx, and not
Windows-12xx. Bcs even for russian there are 4-5 charsets and some of them very
hard to distinguish without a dictionary or text statistics. So it may be a
heuristic but not 100% reliable method.

But I think Ark can do something like Ubuntu's unzip have:

1. A small prebuilt table to match current locale to encoding supposed to come
from Win-created ZIPs (like here:
https://github.com/zip-i18n/unzip/blob/master/debian/patches/20-unzip60-alt-iconv-utf8#L36)
in assumption that Linux and Windows users spoke same language.

2. Ark can copy-paste cool menu from Kate (Tools/Encodings) that will let user
switch to one of encodings available in his system in GUI. And use this choice
to display filenames and archive commentaries in GUI as well as for I/O
operations while extracting files. This will allow user to find proper charset
and get files extracted.

-- 
You are receiving this mail because:
You are watching all bug changes.

[ark] [Bug 378904] Ark should use charset auto-detection for filenames

Reply via email to