Re: zero width space

2005-11-04 Thread J.Pietschmann

Manuel Mall wrote:

What about character composition/decomposition?


Good question? Where is the answer?


Lets clarify the problem first. Let's say the input contains
the sequence U+0061 U+0308 (latin small a, combining diaresis),
the font has a glyph for U+00E4 but not U+0308. Obviously,
putting the precomposed character U+00E4 into the output is
a smart move. Where should this transformation occur: output
generation, renderer, layout stage? A slight problem is that
the width of U+00E4 may be different from U+0061.

J.Pietschmann


Re: Leading/trailing space removal in LineLM

2005-11-04 Thread J.Pietschmann

Luca Furini wrote:
note that a word with a soft hyphen in its middle would not be 
hyphenated, unless we ignore this character when collecting word fragments


Well, in order to prepare for hyphenation, other characters
like joiners has to be removed too. We should probably also
use Unicode normalization.

J.Pietschmann




Re: zero width space

2005-11-04 Thread The Web Maestro

On Nov 3, 2005, at 9:31 PM, Manuel Mall wrote:

Thanks a lot Peter. Seems like the Unicode consortium did change their
mind on U+200B between version 3.0 and 4.0. For the purpose of the
current version of FOP which does not (yet) recognise scripts nor
allows customisation of such behaviours we will then stick with ZWS not
affecting justification I assume?

Manuel


Just for clarity (and for the archives), does this mean FOP will be 
supporting version Unicode 3.0 or Unicode 4.0?


Regards,

Web Maestro Clay
--
<[EMAIL PROTECTED]> - 
My religion is simple. My religion is kindness.
- HH The 14th Dalai Lama of Tibet



Re: Leading/trailing space removal in LineLM

2005-11-04 Thread Luca Furini

Manuel Mall wrote:


Here are some of the combinations I have identified:

1. Non breaking / non elastic space => probably just a normal character, 
i.e. part of a word.


2. Non breaking / elastic space - eg. U+00A0 Non breaking space
=> Must prevent break
=> Must handle text-align

3. Break / non elastic - eg. U+200B ZWSP, any other break between two 
characters not involving adding or removing space/characters

=> Must handle border/padding
=> Must handle text-align

4. Break / non elastic / remove if not break - eg. U+00AD Soft hyphen
=> Must remove if not at break
=> Must handle border/padding
=> Must handle text-align

5. Break / non elastic / add character if break - eg. hyphenation
=> Must add space for hyphen if at break
=> Must handle border/padding
=> Must handle text-align

6. Breaking / elastic / non removable - eg. U+3000 Ideographic space
=> Must handle border/padding
=> Must handle text-align
	Question: XSL-FO does not define U+3000 as removable white space but 
would under common CJK typesetting conventions this be removed at a 
line break?


7. Breaking / elastic / removable - eg. U+0020 Space
=> Can occur in runs which must be wholly removed
=> Must handle border/padding
=> Must handle text-align

Any combinations I have missed, e.g. is there a "break / non elastic / 
remove at break" case?


Maybe the fixed width spaces?

Anyway, it seems an exhaustive analysis of the problem!

Just a few comments / thoughts:

- non breaking, non elastic: the simple solution would be to handle these 
characters as normal "letters", so the text "before_after" (where _ is 
zwnbsp) would create a single AreaInfo object in the TextLM; but this 
would create problems during hyphenation, as non-letter characters in the 
middle of a word ATM prevents hyphenation


- soft hyphen: at the moment it is not properly handled, but it won't be 
difficult to fix the implementation; it could create the same elements 
used for an hyphenation point, but the penalty could have a negative value 
(as probably users would use it to "suggest" a desired line break); note 
that a word with a soft hyphen in its middle would not be hyphenated, 
unless we ignore this character when collecting word fragments


Regards
Luca