https://bugs.documentfoundation.org/show_bug.cgi?id=147914
Bug ID: 147914 Summary: File over-read parsing XLS with mixed wide- and narrow-character strings Product: LibreOffice Version: 7.3.1.3 release Hardware: All OS: All Status: UNCONFIRMED Severity: normal Priority: medium Component: Calc Assignee: libreoffice-bugs@lists.freedesktop.org Reporter: rennie.degr...@gmail.com Description: The XLS format has a maximum record length of 8224 bytes. The maximum string length is 32767 characters (a character whose UTF-16 representation requires a conjugate pairs counts at two characters). Consequently, long strings must be split across multiple records using "continue records" (https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/999fae21-d3d9-42e8-8290-639782460c67). Strings are represented as "XLUnicodeRichExtendedString" objects (https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/173d9f51-e5d3-43da-8de2-be7f22e119b9). They may use either narrow (8-bit) or wide (UTF-16LE) characters; which is used by a particular string is indicated by a flag. For whatever reason (blame some nameless dev in the 1990s), the flag is repeated in each continue record. Consequently, it is valid for a string to start off using narrow characters and be continued by a wide character block. Yes, this is perverse. In order to test some other software that parses XLS, I used Excel to create an XLS with a 32767-character narrow-character string ("aaa....aaa"), then opened it up using a OLE compound document hex editor ("Compound File Explorer", though the tool that you use should not matter). My string was split across four records, as expected (in the "Workbook" OLE stream). I changed the narrow/wide character flag byte to 0x01 (indicating wide character data) on the 2nd and 4th blocks. Since XLS uses UTF-16 for wide characters, this changes the string to "aaa...aaa慡慡慡...慡慡慡aaa...aaa慡慡慡...慡慡慡". However, I did *not* update the string length. Since those two blocks are now wide characters but I did not add any additional data, the string should be shorter. This makes the document invalid. Excel goes into recovery mode when trying to load it. However, Calc loads the following string: aaa...aaa慡慡慡...慡慡慡aaa...aaa慡慡慡...慡慡慡一浡ե?慖畬ť?ɡ?慡愀慡愀慡ա?慡慡ୡ?敄捳楲瑰潩੮?桓牯⁴慮敭 䰀湯慮敭䄀瑬牥慮整搠獥牣灩楴湯?潓敭桴湩 Copying the extraneous data into a text file, saving it as UTF-16LE and opening it in a hex editor reveals 0x76 bytes of file data following the end of the last string block: 04 00 00 4E 61 6D 65 05 3F 00 56 61 6C 75 65 01 3F 00 61 02 3F 00 61 61 03 00 00 61 61 61 04 00 00 61 61 61 61 05 3F 00 61 61 61 61 61 0B 3F 00 44 65 73 63 72 69 70 74 69 6F 6E 0A 3F 00 53 68 6F 72 74 20 6E 61 6D 65 09 00 00 4C 6F 6E 67 20 6E 61 6D 65 15 00 00 41 6C 74 65 72 6E 61 74 65 20 64 65 73 63 72 69 70 74 69 6F 6E 3F 00 53 6F 6D 65 74 68 69 6E I didn't try debugging into Calc to see where/how it got this data. There might be security implications depending on how/where the over-read occurs. I created a second version of the XLS file in which I corrected the string length. Calc appeared to handle that file correctly. I tested this using release 7.3.1.3 on Windows 10 amd64. I expect that the same will occur on other platforms and versions since XLS is a rather old format. Steps to Reproduce: 1. Create a malformed XLS file as described above 2. Open in Calc Actual Results: Over-read file data is displayed in the document as described above Expected Results: No over-read file data should appear. Reproducible: Always User Profile Reset: No Additional Info: Version: 7.3.1.3 (x64) / LibreOffice Community Build ID: a69ca51ded25f3eefd52d7bf9a5fad8c90b87951 CPU threads: 2; OS: Windows 10.0 Build 19042; UI render: Skia/Raster; VCL: win Locale: en-US (en_US); UI: en-US Calc: threaded -- You are receiving this mail because: You are the assignee for the bug.