Re: Initial soft hyphen support

2007-01-15 Thread Andreas L Delmelle

On Jan 15, 2007, at 21:25, J.Pietschmann wrote:


Andreas L Delmelle wrote:
BTW: I took a very quick look, and does anyone know if there is a  
good reason why Hyphenation.word is a String?


The hyphenator  interface goes through several wrapping layers,
probably due to the usual "take working code and wrap it to fit
the caller" method.


Looks that way...
Traced it down, and in TextLM.getWordChars() we get

  sbChars.append(new String(textArray, ai.iStartIndex,
  ai.iBreakIndex - ai.iStartIndex));


Not really sure what would be most efficient:
- a void method appending to a parameter StringBuffer
- a method returning a copy of the char[] from index to index...

Seen that every String ultimately has a backing char[](*) anyway, I'd  
say that we can safely return the copy, and remove the overhead of


StringBuffer.append(new String(char[])).toString().toCharArray()

Hmmm... Put it like that, and this would almost be one for the Daily  
WTF! 8-)


(*) which BTW, answers the question about the char[] instances being  
twice that of the text-nodes in the document in the snapshot posted  
by Richard earlier on in the thread about memory issues. Sure, there  
are some 39K text-nodes in the document, but there are most likely at  
least as many non-internalized property values (cfr. the number of  
String instances)...



This which always seemed to be overly complicated for me. I tried
to come up with a comprehensive API for hyphenation (which would
also be applicable to spelling and other similar tasks).  
Unfortunately,

there doesn't seem to be any usable standard, all APIs I've seen
are very specific or simply horrible. Any simplification is certainly
welcome.


A quick-and-dirty hack to make the Hyphenator return a Hyphenation as  
I described earlier on --hyph-point for the SHY and the rest as two  
separate hyphenated words-- doesn't seem too hard to pull off, but it  
would be an exception for the SHY only. For a more comprehensive  
approach, I currently don't know enough about hyphenation basics, I'm  
afraid...



Cheers,

Andreas


Re: Unicode issues

2007-01-15 Thread J.Pietschmann

Simon Pepping wrote:

Aren't ligatures a feature of the font,


Yes and no. While the font may provide a glyph, it is the
responsibility of the content rendering code to decide whether
a ligature should be used. Deciding whether a ligature is
applicable is not necessarily trivial, for example there is
usually a ligature for the "ffi" character triple (especially
for slanted fonts), but the word "auffinden" uses only the "fi"
ligature because "auf-" is a prefix for the verb "finden". The
oe ligature in swedish (or danish) has even weirder rules.

The renderer may try to fake a ligature if the font doesn't have
a glyph, or it may just ignore the issue.


NFC is the standard for the web. Does that carry any weight?


Given that this choice is buried deep in the FOP engine, far
away from any interface intended for average users, I'd say
we can choose whatever we think suites best. It's only a
concern for hyphenation pattern generation and the FOP dev
team.

J.Pietschmann


Re: Initial soft hyphen support

2007-01-15 Thread J.Pietschmann

Andreas L Delmelle wrote:
BTW: I took a very quick look, and does anyone know if there is a good 
reason why Hyphenation.word is a String?


The hyphenator  interface goes through several wrapping layers,
probably due to the usual "take working code and wrap it to fit
the caller" method.
This which always seemed to be overly complicated for me. I tried
to come up with a comprehensive API for hyphenation (which would
also be applicable to spelling and other similar tasks). Unfortunately,
there doesn't seem to be any usable standard, all APIs I've seen
are very specific or simply horrible. Any simplification is certainly
welcome.

J.Pietschmann


Re: Unicode issues

2007-01-15 Thread Simon Pepping
On Mon, Jan 15, 2007 at 04:42:12PM +0100, J.Pietschmann wrote:
 
> As for Ligatures and character shaping: an algorithm for automatically
> detecting ligature points may use a pattern lookup similar to the
> pattern based hyphenation. The pattern dictionary should store only
> either NFD or NFC forms, for the same reason this is advisable for
> hyphenation.

Aren't ligatures a feature of the font, e.g. the GSUB table of an Open
Type font? That is, one font may have a specific ligature, while
another font does not.

> We should choose either NFD or NFC as a canonical representation for
> hyphenation patters (and, in the future, for similar things), so that
> hyphenation patterns containing umlauts can be found regardless of
> the representation of the umlaut in the source file. Currently, we
> don't care much, which works but may break suddenly.
> There is obviously a slight space vs. run time tradeoff (NFC ought to
> be more compact but NFC'ing the source text may be more expensive
> than NFD'ing).

NFC is the standard for the web. Does that carry any weight?

Simon

-- 
Simon Pepping
home page: http://www.leverkruid.eu


Re: Unicode issues

2007-01-15 Thread Simon Pepping
On Sun, Jan 14, 2007 at 11:05:33PM +0100, J.Pietschmann wrote:
> There are libraries which already implement UAX#15 properly, e.g. icu4j,
> but especially icu4j is a rather large blob of a jar. I think Unicode
> normalization should be handled like PDF encryption: do it if the
> library is available, otherwise emit a warning and simply skip the
> step.

That is a good idea. I think we should regard ICU4J as a base library
and not implement anything which is available in it. We may expect
people who want decent Unicode functionality to have the ICU libraries
installed.

Simon

-- 
Simon Pepping
home page: http://www.leverkruid.eu


Re: Initial soft hyphen support

2007-01-15 Thread Andreas L Delmelle

On Jan 14, 2007, at 23:11, J.Pietschmann wrote:


Andreas L Delmelle wrote:
The SHY character will be presented to the hyphenator simply as a  
character of the word it appears in. The hyphenator should then be  
smart enough to recognize this as a special character, and do  
something like: create a hyphenation point for the SHY, ...


Unfortunately, the hyphenator currently isn't as nearly as smart,
and it's a major job to push it in this direction. E.g. it means
major API changes.


Unfortunate indeed :(

BTW: I took a very quick look, and does anyone know if there is a  
good reason why Hyphenation.word is a String? I mean, everything that  
comes from FOText and passes through TextLM is already char[]. The  
Hyphenation constructor takes a String parameter, so I guess  
somewhere --haven't looked yet-- a String is constructed from the  
portion of char[] that is to be hyphenated. If you then look at  
HyphenationTree, it says word.toCharArray()...



Cheers,

Andreas



Re: FOP Memory issues (fwd from fop-users)

2007-01-15 Thread Andreas L Delmelle

On Jan 14, 2007, at 22:51, J.Pietschmann wrote:


Andreas L Delmelle wrote:
Based on Jörgs statistics, I'd say that the number of children  
will most likely never reach the level where using direct index- 
based access (ArrayList) has its benefits over traversing a tree  
of references (LinkedList).


There may be FOs, specifically fo:flow and fo:table-body, which may
have hundreds of children in real documents.

If the FOs use access functions for the children, even for adding,
each FO can implement a mechanism which suites its purposes best. In
particular, page masters and tables can store the regions in typed
fields, FOs which don't have children can get rid of the field
completely.


This happens already for some FOs. Tables for instance, override  
addChildNode() and store only bodies as their 'child nodes'. The  
header and footer have separate typed fields, the columns end up in a  
separate standard List. Behavior for markers is another example, they  
too end up in a separate collection (map).


I've already been experimenting locally with removing the List  
member, and replacing it with pointers to the neighbors. As I re- 
implemented FObj.addChildNode(), it occurred to me that FOs with a  
possibly larger amount of children would need to traverse the entire  
list to append child nodes to the end. For a handful of children,  
this costs virtually nothing, but as the list gets bigger... Could  
easily be solved though for the applicable FOs, by keeping a  
reference to the last added child, and giving it an addSibling() method.




I'm not sure whether the layout engine places any restrictions on the
access to FO children. Is it possible that access is not random?


FTM, there are only a handful of places in the code where the  
childNodes list is accessed using an index.
The only way it can be accessed by classes outside the fo package is  
through FObj.getChildNodes(), and that method returns a ListIterator,  
not the list itself. The LMs only need access to these iterators  
while creating their childLMs, which is done in forward sequential  
order, i.e.


for (Iterator i = AbstractLM.fobjIter; i.hasNext();) { ... }

[Note: these are ArrayLists we're talking about. Iterators are far  
from the best approach with lists implementing RandomAccess. Not that  
this matters much if the lists are small...]


This fobjIter is created in the constructor of the LM. IIC, this  
means that there is currently no way to, for example, instantiate the  
FlowLM as soon as the first endBlock() event occurs, and keep  
extending the flow's list of child nodes in a different thread. The  
FlowLM's fobjIter will then throw a ConcurrentModificationException().


In the structure I'm currently playing with, it would be possible to  
construct an iterator over an FObj whose child nodes are not all  
known, move to the last node in the list, put the iterator in a stack  
(or keep a reference as an instance member), continue adding nodes in  
another thread, return, and continue iterating (a call to next() will  
then return the first node that was added after the iterator was  
paused).


That would be mighty dangerous if the access to the nodes was not so  
predictable, but as it stands now, this could turn out to be a  
benefit, if documented properly.
Removal from and addition to the tail of the list are never a  
problem. The iterator will simply keep following the following- 
sibling reference until it is null.
The head of the list might become a problem if one needs a decent  
implementation of nextIndex() and previousIndex(). If we never really  
need the index, then the fact that a child was added at a previous  
position will simply make the iterator go back a few steps further  
than where it started.




Cheers,

Andreas





Re: Unicode issues

2007-01-15 Thread J.Pietschmann

Manuel Mall wrote:

Font selection in combination with character substitution. Ligatures
and character shaping.


Joerg, can you elaborate on this for me please. 


Fonts may contain glyphs for precomposed Unicode characters, or they
may not. If a list of fonts is searched for a glyph of a character,
it may be useful to look for
- glyphs for the encoded value (which needs the "Grapheme Cluster
  Boundaries" stuff from UAX#29)
- glyphs for the fully decomposed form (UAX#15 NFD)
- glyphs for maximal composition (UAX#15 NFC)

As for Ligatures and character shaping: an algorithm for automatically
detecting ligature points may use a pattern lookup similar to the
pattern based hyphenation. The pattern dictionary should store only
either NFD or NFC forms, for the same reason this is advisable for
hyphenation.

In unicode an 'umlaut' can be 
represented as 1 or 2 codepoints. What in your opinion should fop do 
either a codepoint which can be split into two or vice versa?


We should choose either NFD or NFC as a canonical representation for
hyphenation patters (and, in the future, for similar things), so that
hyphenation patterns containing umlauts can be found regardless of
the representation of the umlaut in the source file. Currently, we
don't care much, which works but may break suddenly.
There is obviously a slight space vs. run time tradeoff (NFC ought to
be more compact but NFC'ing the source text may be more expensive
than NFD'ing).


I noticed that PDF prints a # for a word joiner for example.


Ouch!

That's why I 
thought that most Cf code points should be dealt with in layout and not 
be passed to the renderers.


It depends on the features of the target format. After all, PDF viewers
do kerning and some paragraph typesetting (e.g. line centering) by
themselves if properly instructed. The SVG flow text also has some
"somewhat higher level" functionality, which users might prefer to be
used. Unfortunately, all this has potential to complicate the FOP
layout.

J.Pietschmann


Re: Kerning and hyphenation

2007-01-15 Thread Jeremias Maerki

On 13.01.2007 06:32:15 Manuel Mall wrote:
> While looking into the support for the Soft Hyphen the area of 
> hyphenation combined with kerning is causing me trouble.
> 
> Aside 1: our fonts page claims we don't support kerning but I believe we 
> do.

Yes, we do.

> Aside 2: Our tests are based on setting the base14kerning flag. I don't 
> think this flag is documented anywhere on the fop site. Can it be set 
> in the config file?

Not yet, no, only through the FopFactory.

> I wrote a simple test under fop 0.93 and noticed that the ipd of text 
> areas with the combination of kerning and hyphenation are not correctly 
> calculated (at least I think so). With the Base 14 sans-serif font at 
> 20pt we get:
> 
> Word "vanda" alone on a line: ipd=53980
> Word "vandavanda" alone on a line: ipd=107560 (53980+53980-400)
> Word "vandavandavanda" alone on a line: ipd=161140 
> (53980+53980+53980-400-400)
> If we now take the word "vandavandavandavanda" and get it hyphenated 
> after the 3rd "vanda" the line containing "vandavandavanda-" reports an 
> ipd of 168800 which is incorrect. I think it should be 161140 + 6660 
> (width of hyphen) = 167800.

I think you mean it is 168600 (that's what I get) and it should be
167800, which is a difference of 800mpt. 800mpt looks much like two
times 400mpt kerning which are not removed from the area objects's IPD
for the break at a hyphenation point. However, I'd expect a difference
of only 400mpt, not 800mpt. I did fix [1] the behaviour for that
condition so the layout looks correct, but apparently I didn't adjust
the IPD for the area object.

[1] http://svn.apache.org/viewvc?rev=407774&view=rev

> Before I investigate this more can someone else check this and see if 
> they get similar results.
> 
> I have attached the test case I have constructed.
> 
> Manuel
> 
> 
> 



Jeremias Maerki