RFC: bug #2742

Tomas Frydrych Sun, 17 Mar 2002 10:43:44 -0800


This bug surfaced in the MS Word importer, but it will plague many 
others. I has to do with handling the so called mirror characters, 
such as ( ) or [ ].


In Unicode, mirror characters are defined semantically. For 
instance character u0028 is defined as 'opening parenthesis'. In left-
to-right context an opening parenthesis looks like '(', but in right-to-
left context it looks like ')'. AW uses this semantic definition, and it 
displays the correct glyph depending on the context.

A problem arises when we import documents from some other 
format that does not use the semantic Unicode definition. For 
instance MS Word will not use u0028 for opening parenthesis in 
RTL context, but instead it will store u0029 in its place (which is 
'closing parenthesis'). So, when we load the document and analyse 
the context, we will display a glyph for closing parenthesis in RTL 
context which is '(', while the author intended ')'.

This is a serious bug that needs some fix before the 1.0 release. I 
see two possible avenues:

(1) the MS Word importer carries out the analysis of the context 
and it translates any mirroring characters in RTL context to the 
correct Unicode values. The problem with this is that (a) the 
importer was not designed to analyse the context, it handles a 
character at a time and to get it do this properly would not be 
entirely simple, but surmountable (b) we will have to redo this in 
every importer that handles a file format with the same problem 
(plain text, etc.).

(2) A second solution would be to add a method to our edit 
methods, which would scan through the document for any mirror 
characters in RTL context and replace them with their mirror 
images. This method would be called by the importer once the 
entire document is loaded. The main advantages of this are (a) it 
can be used by any importer that needs it; (b) when the document 
has been loaded, the context has already been analysed, so that it 
is easy to identify the offending characters. The main disadvantage 
is that the character-by-character scanning of the document and 
the deletion/insertion operations carried on the offending characters 
will prolong the document loading, which could be noticeable on 
large files. Also, with the incremental loader in place, the initial 
appearance of the document before the loading is completed would 
be incorrect (unless we can call this fixing method while doing the 
loading -- that should be possible, I think, if it is part of the 
BlockLayout class rather than a independent edit method).

I would appreciate some comments on this, especially if someone 
has a better idea of how to fix this.

Tomas

RFC: bug #2742

Reply via email to