https://bugs.documentfoundation.org/show_bug.cgi?id=88821

            Bug ID: 88821
           Summary: Import HTML file into Calc wrongly assumes source
                    code-page is Windows-1252
           Product: LibreOffice
           Version: 4.3.5.2 release
          Hardware: Other
                OS: All
            Status: UNCONFIRMED
          Severity: normal
          Priority: medium
         Component: Spreadsheet
          Assignee: libreoffice-bugs@lists.freedesktop.org
          Reporter: grof...@hotmail.com

Created attachment 112821
  --> https://bugs.documentfoundation.org/attachment.cgi?id=112821&action=edit
slovenian_utf-8.csv

n LibreOffice Calc 4.3.5.2 on Windows 7 I see Calc wrongly assumes that HTML
input file is ALWAYS in Windows-1252 code page.

MY SETTINGS:
1. Tools | Options | Language Settings | Languages
a) User interface: English (USA)
b) Locale settings: Slovenian
See attachment Tools_Options_Language_settings.png


TEST 1:
1. Start Calc.
2. File | Open
3. Select slovenian_utf-8.csv
4. Text Import dialog opens.
4a. Character set: "Unicode (UTF-8)".
4b. Language selection: Slovenian.
4c. Separated by: Semicolon
4d. Check: Quoted field as text and check Detect special numbers.
4e. Click on Open button.
Result: WORKS FINE. Non-English characters in Text field are correctly
recognized as UTF-8 (setting 4a), Decimal settings is correctly recognized as
comma separator (setting 4b) and Date (setting 4b) field is correctly
recognized as date.


TEST 2:
The same as Test 1 except:
3. Select slovenian_windows-1250.csv
4a.  Character set: "Eastern Europe Windows-1250".
Result: WORKS FINE. Non-English characters in Text field are correctly
recognized as Windows-1250 code page.


TEST 3:
The same as Test 1 expect:
3. Select slovenian_utf-8.html
4. The source of the problem is probably "Import Options" dialog. There is no
option to select character set. The only option is language selection. I
selected Custom: Slovenian.
Result: PROBLEM. Non-English characters in A2 field are corrupted.

If the same file is opened (File | Open File) by Firefox 35 web browser and
non-English characters are correctly opened – there is HTML tag in file:  meta
charset=UTF-8 so browser correctly recognizes the character set. But now in
Firefox select menu View | Character Encoding | Western and you will get the
same corrupted non-English text as in LibreOffice. So LibreOffice just assumes
that input HTML file is ALWAYS in Windows-1252 (Western) code page.


TEST 4:
The same as Test 1 expect:
3. Select: slovenian_windows-1250.html
4. The same problem as in Test 3, the "Import options" dialog does not contain
a character set option to select from. So I just selected Custom: Slovenian.
Result: PROBLEM. Non-English characters in A2 field are corrupted.

If the same file is opened by Firefox 25 web browser and non-English characters
are correctly recognized as Windows-1250 code page (because of HTML meta
charset tag). Now change encoding: View | Character Encoding | Western and you
will get the same corrupted data for non-English characers like in LibreOffice.
So LibreOffice just assumes that input HTML file is ALWAYS in Windows-1252
(Western) code page.


In my humble opinion the problem of importing HTML file is that LibreOffice
assumes that source HTML file is ALWAYS encoded in Western (Windows-1252) code
page.


How should be a problem fixed:

Quick fix: By statistics see e.g. https://en.wikipedia.org/wiki/UTF-8 most of
the web pages nowdays are UTF-8 encoded. UTF-8 is also universal code page for
ANY language in the world. Set default Calc import filter code page to UTF-8.


Permanent fix. Create the same import dialog for HTML files just like it is at
importing CSV files (or at least add "Character set" option). But in this case
please check the meta charset tag in HTML file and set default code-page
selection to HTML meta tag if exists. If meta tag does not exists then default
to UTF-8.

-- 
You are receiving this mail because:
You are the assignee for the bug.
_______________________________________________
Libreoffice-bugs mailing list
Libreoffice-bugs@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs

Reply via email to