https://bugs.documentfoundation.org/show_bug.cgi?id=88821
Bug ID: 88821 Summary: Import HTML file into Calc wrongly assumes source code-page is Windows-1252 Product: LibreOffice Version: 4.3.5.2 release Hardware: Other OS: All Status: UNCONFIRMED Severity: normal Priority: medium Component: Spreadsheet Assignee: libreoffice-bugs@lists.freedesktop.org Reporter: grof...@hotmail.com Created attachment 112821 --> https://bugs.documentfoundation.org/attachment.cgi?id=112821&action=edit slovenian_utf-8.csv n LibreOffice Calc 4.3.5.2 on Windows 7 I see Calc wrongly assumes that HTML input file is ALWAYS in Windows-1252 code page. MY SETTINGS: 1. Tools | Options | Language Settings | Languages a) User interface: English (USA) b) Locale settings: Slovenian See attachment Tools_Options_Language_settings.png TEST 1: 1. Start Calc. 2. File | Open 3. Select slovenian_utf-8.csv 4. Text Import dialog opens. 4a. Character set: "Unicode (UTF-8)". 4b. Language selection: Slovenian. 4c. Separated by: Semicolon 4d. Check: Quoted field as text and check Detect special numbers. 4e. Click on Open button. Result: WORKS FINE. Non-English characters in Text field are correctly recognized as UTF-8 (setting 4a), Decimal settings is correctly recognized as comma separator (setting 4b) and Date (setting 4b) field is correctly recognized as date. TEST 2: The same as Test 1 except: 3. Select slovenian_windows-1250.csv 4a. Character set: "Eastern Europe Windows-1250". Result: WORKS FINE. Non-English characters in Text field are correctly recognized as Windows-1250 code page. TEST 3: The same as Test 1 expect: 3. Select slovenian_utf-8.html 4. The source of the problem is probably "Import Options" dialog. There is no option to select character set. The only option is language selection. I selected Custom: Slovenian. Result: PROBLEM. Non-English characters in A2 field are corrupted. If the same file is opened (File | Open File) by Firefox 35 web browser and non-English characters are correctly opened – there is HTML tag in file: meta charset=UTF-8 so browser correctly recognizes the character set. But now in Firefox select menu View | Character Encoding | Western and you will get the same corrupted non-English text as in LibreOffice. So LibreOffice just assumes that input HTML file is ALWAYS in Windows-1252 (Western) code page. TEST 4: The same as Test 1 expect: 3. Select: slovenian_windows-1250.html 4. The same problem as in Test 3, the "Import options" dialog does not contain a character set option to select from. So I just selected Custom: Slovenian. Result: PROBLEM. Non-English characters in A2 field are corrupted. If the same file is opened by Firefox 25 web browser and non-English characters are correctly recognized as Windows-1250 code page (because of HTML meta charset tag). Now change encoding: View | Character Encoding | Western and you will get the same corrupted data for non-English characers like in LibreOffice. So LibreOffice just assumes that input HTML file is ALWAYS in Windows-1252 (Western) code page. In my humble opinion the problem of importing HTML file is that LibreOffice assumes that source HTML file is ALWAYS encoded in Western (Windows-1252) code page. How should be a problem fixed: Quick fix: By statistics see e.g. https://en.wikipedia.org/wiki/UTF-8 most of the web pages nowdays are UTF-8 encoded. UTF-8 is also universal code page for ANY language in the world. Set default Calc import filter code page to UTF-8. Permanent fix. Create the same import dialog for HTML files just like it is at importing CSV files (or at least add "Character set" option). But in this case please check the meta charset tag in HTML file and set default code-page selection to HTML meta tag if exists. If meta tag does not exists then default to UTF-8. -- You are receiving this mail because: You are the assignee for the bug.
_______________________________________________ Libreoffice-bugs mailing list Libreoffice-bugs@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs