https://bugs.documentfoundation.org/show_bug.cgi?id=147914

            Bug ID: 147914
           Summary: File over-read parsing XLS with mixed wide- and
                    narrow-character strings
           Product: LibreOffice
           Version: 7.3.1.3 release
          Hardware: All
                OS: All
            Status: UNCONFIRMED
          Severity: normal
          Priority: medium
         Component: Calc
          Assignee: libreoffice-bugs@lists.freedesktop.org
          Reporter: rennie.degr...@gmail.com

Description:
The XLS format has a maximum record length of 8224 bytes.  The maximum string
length is 32767 characters (a character whose UTF-16 representation requires a
conjugate pairs counts at two characters).  Consequently, long strings must be
split across multiple records using "continue records"
(https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/999fae21-d3d9-42e8-8290-639782460c67).
 

Strings are represented as "XLUnicodeRichExtendedString" objects
(https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/173d9f51-e5d3-43da-8de2-be7f22e119b9).
 They may use either narrow (8-bit) or wide (UTF-16LE) characters; which is
used by a particular string is indicated by a flag.  For whatever reason (blame
some nameless dev in the 1990s), the flag is repeated in each continue record. 
Consequently, it is valid for a string to start off using narrow characters and
be continued by a wide character block.  Yes, this is perverse.

In order to test some other software that parses XLS, I used Excel to create an
XLS with a 32767-character narrow-character string ("aaa....aaa"), then opened
it up using a OLE compound document hex editor ("Compound File Explorer",
though the tool that you use should not matter).  My string was split across
four records, as expected (in the "Workbook" OLE stream).  I changed the
narrow/wide character flag byte to 0x01 (indicating wide character data) on the
2nd and 4th blocks. Since XLS uses UTF-16 for wide characters, this changes the
string to "aaa...aaa慡慡慡...慡慡慡aaa...aaa慡慡慡...慡慡慡".

However, I did *not* update the string length.  Since those two blocks are now
wide characters but I did not add any additional data, the string should be
shorter.  This makes the document invalid.  Excel goes into recovery mode when
trying to load it.  However, Calc loads the following string:

aaa...aaa慡慡慡...慡慡慡aaa...aaa慡慡慡...慡慡慡一浡ե?慖畬ť?ɡ?慡愀慡愀慡ա?慡慡ୡ?敄捳楲瑰潩੮?桓牯⁴慮敭       
䰀湯⁧慮敭䄀瑬牥慮整搠獥牣灩楴湯?潓敭桴湩

Copying the extraneous data into a text file, saving it as UTF-16LE and opening
it in a hex editor reveals 0x76 bytes of file data following the end of the
last string block:

04 00 00 4E 61 6D 65 05 3F 00 56 61 6C 75 65 01 3F 00 61 02 3F 00 61 61 03 00
00 61 61 61 04 00 00 61 61 61 61 05 3F 00 61 61 61 61 61 0B 3F 00 44 65 73 63
72 69 70 74 69 6F 6E 0A 3F 00 53 68 6F 72 74 20 6E 61 6D 65 09 00 00 4C 6F 6E
67 20 6E 61 6D 65 15 00 00 41 6C 74 65 72 6E 61 74 65 20 64 65 73 63 72 69 70
74 69 6F 6E 3F 00 53 6F 6D 65 74 68 69 6E

I didn't try debugging into Calc to see where/how it got this data.  There
might be security implications depending on how/where the over-read occurs.

I created a second version of the XLS file in which I corrected the string
length.  Calc appeared to handle that file correctly.

I tested this using release 7.3.1.3 on Windows 10 amd64.  I expect that the
same will occur on other platforms and versions since XLS is a rather old
format.

Steps to Reproduce:
1. Create a malformed XLS file as described above
2. Open in Calc

Actual Results:
Over-read file data is displayed in the document as described above

Expected Results:
No over-read file data should appear.


Reproducible: Always


User Profile Reset: No



Additional Info:
Version: 7.3.1.3 (x64) / LibreOffice Community
Build ID: a69ca51ded25f3eefd52d7bf9a5fad8c90b87951
CPU threads: 2; OS: Windows 10.0 Build 19042; UI render: Skia/Raster; VCL: win
Locale: en-US (en_US); UI: en-US
Calc: threaded

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to