Public bug reported:

In the RTF format, the \u command is used to specify Unicode characters,
and the value of the \uc command specifies how many characters to ignore
after each \u command.  The RTF specification says that the default
value of \uc should be 1.

Some RTF files explicitly set \uc1, even though this is not required as
1 is already the default.  This bug activates on RTF files that set \uc1
explicitly.

In rtfread.c line 265 the loop says while((--i)>0) which means it will
iterate i-1 times, where i has just been set to groups[group_count].uc.
Therefore, if the uc value is 1, this loop will iterate zero times and
no characters will be ignored after the \u command, whereas actually one
fallback character should have been ignored.  This usually manifests
itself as affected RTF files showing a question mark after every non-
ASCII character.

Line 205 of the same file sets groups[0].uc = 2 with the comment Default
uc = 2.  This comment is incorrect according to the RTF specification,
which says that the default value of uc is 1.  However, setting it to 2
does work around the fact that the loop starting on line 265 iterates
only i-1 times instead of i times.

Therefore, if the RTF file does not contain any \uc commands, the
default value of uc is 1 and the code behaves correctly, because it sets
uc to 2 and then effectively subtracts 1 from this by way of using pre-
decrement instead of post-decrement in the loop test.  But if \uc1 is
set explicity by the RTF file, then line 253 comes into play, which says
groups[group_count].uc=com.numarg setting the uc variable to 1 instead
of 2, and then the loop on line 265 iterates zero times and the fallback
character gets included.

The quickest way to fix this bug would be to add a +1 before the
semicolon at the end of line 253, but I think the code would be clearer
if the two instances of the number 2 on line 205 could be changed to 1
and then change line 265 such that instead of saying while((--i)>0) it
says while((i--)>0), so that the uc variable contains the actual value
of the \uc command and the loop iterates the correct number of times.

Meanwhile, the bug can be worked around in most cases by deleting any
instance of \uc1 in the input RTF before feeding it to catdoc.  That
workaround applies only for RTF files that never use any \uc value other
than 1.  For such RTF files, it is sufficient to use the command:

sed -e 's/\uc1//g' < file.rtf | catdoc

ProblemType: Bug
DistroRelease: Ubuntu 22.04
Package: catdoc 1:0.95-5
ProcVersionSignature: User Name 6.5.0-1027.27~22.04.1-oracle 6.5.13
Uname: Linux 6.5.0-1027-oracle x86_64
ApportVersion: 2.20.11-0ubuntu82.5
Architecture: amd64
CasperMD5CheckResult: unknown
CloudArchitecture: x86_64
CloudID: oracle
CloudName: oracle
CloudPlatform: oracle
CloudSubPlatform: metadata (http://169.254.169.254/opc/v2/)
Date: Wed Aug  7 14:36:50 2024
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_GB.UTF8
 SHELL=/bin/bash
SourcePackage: catdoc
UpgradeStatus: No upgrade log present (probably fresh install)

** Affects: catdoc (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: amd64 apport-bug jammy uec-images

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2076244

Title:
  RTF files containing \uc1 show fallback characters after Unicode

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/catdoc/+bug/2076244/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to