Re: Initial soft hyphen support
On Jan 15, 2007, at 21:25, J.Pietschmann wrote: Andreas L Delmelle wrote: BTW: I took a very quick look, and does anyone know if there is a good reason why Hyphenation.word is a String? The hyphenator interface goes through several wrapping layers, probably due to the usual "take working code and wrap it to fit the caller" method. Looks that way... Traced it down, and in TextLM.getWordChars() we get sbChars.append(new String(textArray, ai.iStartIndex, ai.iBreakIndex - ai.iStartIndex)); Not really sure what would be most efficient: - a void method appending to a parameter StringBuffer - a method returning a copy of the char[] from index to index... Seen that every String ultimately has a backing char[](*) anyway, I'd say that we can safely return the copy, and remove the overhead of StringBuffer.append(new String(char[])).toString().toCharArray() Hmmm... Put it like that, and this would almost be one for the Daily WTF! 8-) (*) which BTW, answers the question about the char[] instances being twice that of the text-nodes in the document in the snapshot posted by Richard earlier on in the thread about memory issues. Sure, there are some 39K text-nodes in the document, but there are most likely at least as many non-internalized property values (cfr. the number of String instances)... This which always seemed to be overly complicated for me. I tried to come up with a comprehensive API for hyphenation (which would also be applicable to spelling and other similar tasks). Unfortunately, there doesn't seem to be any usable standard, all APIs I've seen are very specific or simply horrible. Any simplification is certainly welcome. A quick-and-dirty hack to make the Hyphenator return a Hyphenation as I described earlier on --hyph-point for the SHY and the rest as two separate hyphenated words-- doesn't seem too hard to pull off, but it would be an exception for the SHY only. For a more comprehensive approach, I currently don't know enough about hyphenation basics, I'm afraid... Cheers, Andreas
Re: Unicode issues
Simon Pepping wrote: Aren't ligatures a feature of the font, Yes and no. While the font may provide a glyph, it is the responsibility of the content rendering code to decide whether a ligature should be used. Deciding whether a ligature is applicable is not necessarily trivial, for example there is usually a ligature for the "ffi" character triple (especially for slanted fonts), but the word "auffinden" uses only the "fi" ligature because "auf-" is a prefix for the verb "finden". The oe ligature in swedish (or danish) has even weirder rules. The renderer may try to fake a ligature if the font doesn't have a glyph, or it may just ignore the issue. NFC is the standard for the web. Does that carry any weight? Given that this choice is buried deep in the FOP engine, far away from any interface intended for average users, I'd say we can choose whatever we think suites best. It's only a concern for hyphenation pattern generation and the FOP dev team. J.Pietschmann
Re: Initial soft hyphen support
Andreas L Delmelle wrote: BTW: I took a very quick look, and does anyone know if there is a good reason why Hyphenation.word is a String? The hyphenator interface goes through several wrapping layers, probably due to the usual "take working code and wrap it to fit the caller" method. This which always seemed to be overly complicated for me. I tried to come up with a comprehensive API for hyphenation (which would also be applicable to spelling and other similar tasks). Unfortunately, there doesn't seem to be any usable standard, all APIs I've seen are very specific or simply horrible. Any simplification is certainly welcome. J.Pietschmann
Re: Unicode issues
On Mon, Jan 15, 2007 at 04:42:12PM +0100, J.Pietschmann wrote: > As for Ligatures and character shaping: an algorithm for automatically > detecting ligature points may use a pattern lookup similar to the > pattern based hyphenation. The pattern dictionary should store only > either NFD or NFC forms, for the same reason this is advisable for > hyphenation. Aren't ligatures a feature of the font, e.g. the GSUB table of an Open Type font? That is, one font may have a specific ligature, while another font does not. > We should choose either NFD or NFC as a canonical representation for > hyphenation patters (and, in the future, for similar things), so that > hyphenation patterns containing umlauts can be found regardless of > the representation of the umlaut in the source file. Currently, we > don't care much, which works but may break suddenly. > There is obviously a slight space vs. run time tradeoff (NFC ought to > be more compact but NFC'ing the source text may be more expensive > than NFD'ing). NFC is the standard for the web. Does that carry any weight? Simon -- Simon Pepping home page: http://www.leverkruid.eu
Re: Unicode issues
On Sun, Jan 14, 2007 at 11:05:33PM +0100, J.Pietschmann wrote: > There are libraries which already implement UAX#15 properly, e.g. icu4j, > but especially icu4j is a rather large blob of a jar. I think Unicode > normalization should be handled like PDF encryption: do it if the > library is available, otherwise emit a warning and simply skip the > step. That is a good idea. I think we should regard ICU4J as a base library and not implement anything which is available in it. We may expect people who want decent Unicode functionality to have the ICU libraries installed. Simon -- Simon Pepping home page: http://www.leverkruid.eu
Re: Initial soft hyphen support
On Jan 14, 2007, at 23:11, J.Pietschmann wrote: Andreas L Delmelle wrote: The SHY character will be presented to the hyphenator simply as a character of the word it appears in. The hyphenator should then be smart enough to recognize this as a special character, and do something like: create a hyphenation point for the SHY, ... Unfortunately, the hyphenator currently isn't as nearly as smart, and it's a major job to push it in this direction. E.g. it means major API changes. Unfortunate indeed :( BTW: I took a very quick look, and does anyone know if there is a good reason why Hyphenation.word is a String? I mean, everything that comes from FOText and passes through TextLM is already char[]. The Hyphenation constructor takes a String parameter, so I guess somewhere --haven't looked yet-- a String is constructed from the portion of char[] that is to be hyphenated. If you then look at HyphenationTree, it says word.toCharArray()... Cheers, Andreas
Re: FOP Memory issues (fwd from fop-users)
On Jan 14, 2007, at 22:51, J.Pietschmann wrote: Andreas L Delmelle wrote: Based on Jörgs statistics, I'd say that the number of children will most likely never reach the level where using direct index- based access (ArrayList) has its benefits over traversing a tree of references (LinkedList). There may be FOs, specifically fo:flow and fo:table-body, which may have hundreds of children in real documents. If the FOs use access functions for the children, even for adding, each FO can implement a mechanism which suites its purposes best. In particular, page masters and tables can store the regions in typed fields, FOs which don't have children can get rid of the field completely. This happens already for some FOs. Tables for instance, override addChildNode() and store only bodies as their 'child nodes'. The header and footer have separate typed fields, the columns end up in a separate standard List. Behavior for markers is another example, they too end up in a separate collection (map). I've already been experimenting locally with removing the List member, and replacing it with pointers to the neighbors. As I re- implemented FObj.addChildNode(), it occurred to me that FOs with a possibly larger amount of children would need to traverse the entire list to append child nodes to the end. For a handful of children, this costs virtually nothing, but as the list gets bigger... Could easily be solved though for the applicable FOs, by keeping a reference to the last added child, and giving it an addSibling() method. I'm not sure whether the layout engine places any restrictions on the access to FO children. Is it possible that access is not random? FTM, there are only a handful of places in the code where the childNodes list is accessed using an index. The only way it can be accessed by classes outside the fo package is through FObj.getChildNodes(), and that method returns a ListIterator, not the list itself. The LMs only need access to these iterators while creating their childLMs, which is done in forward sequential order, i.e. for (Iterator i = AbstractLM.fobjIter; i.hasNext();) { ... } [Note: these are ArrayLists we're talking about. Iterators are far from the best approach with lists implementing RandomAccess. Not that this matters much if the lists are small...] This fobjIter is created in the constructor of the LM. IIC, this means that there is currently no way to, for example, instantiate the FlowLM as soon as the first endBlock() event occurs, and keep extending the flow's list of child nodes in a different thread. The FlowLM's fobjIter will then throw a ConcurrentModificationException(). In the structure I'm currently playing with, it would be possible to construct an iterator over an FObj whose child nodes are not all known, move to the last node in the list, put the iterator in a stack (or keep a reference as an instance member), continue adding nodes in another thread, return, and continue iterating (a call to next() will then return the first node that was added after the iterator was paused). That would be mighty dangerous if the access to the nodes was not so predictable, but as it stands now, this could turn out to be a benefit, if documented properly. Removal from and addition to the tail of the list are never a problem. The iterator will simply keep following the following- sibling reference until it is null. The head of the list might become a problem if one needs a decent implementation of nextIndex() and previousIndex(). If we never really need the index, then the fact that a child was added at a previous position will simply make the iterator go back a few steps further than where it started. Cheers, Andreas
Re: Unicode issues
Manuel Mall wrote: Font selection in combination with character substitution. Ligatures and character shaping. Joerg, can you elaborate on this for me please. Fonts may contain glyphs for precomposed Unicode characters, or they may not. If a list of fonts is searched for a glyph of a character, it may be useful to look for - glyphs for the encoded value (which needs the "Grapheme Cluster Boundaries" stuff from UAX#29) - glyphs for the fully decomposed form (UAX#15 NFD) - glyphs for maximal composition (UAX#15 NFC) As for Ligatures and character shaping: an algorithm for automatically detecting ligature points may use a pattern lookup similar to the pattern based hyphenation. The pattern dictionary should store only either NFD or NFC forms, for the same reason this is advisable for hyphenation. In unicode an 'umlaut' can be represented as 1 or 2 codepoints. What in your opinion should fop do either a codepoint which can be split into two or vice versa? We should choose either NFD or NFC as a canonical representation for hyphenation patters (and, in the future, for similar things), so that hyphenation patterns containing umlauts can be found regardless of the representation of the umlaut in the source file. Currently, we don't care much, which works but may break suddenly. There is obviously a slight space vs. run time tradeoff (NFC ought to be more compact but NFC'ing the source text may be more expensive than NFD'ing). I noticed that PDF prints a # for a word joiner for example. Ouch! That's why I thought that most Cf code points should be dealt with in layout and not be passed to the renderers. It depends on the features of the target format. After all, PDF viewers do kerning and some paragraph typesetting (e.g. line centering) by themselves if properly instructed. The SVG flow text also has some "somewhat higher level" functionality, which users might prefer to be used. Unfortunately, all this has potential to complicate the FOP layout. J.Pietschmann
Re: Kerning and hyphenation
On 13.01.2007 06:32:15 Manuel Mall wrote: > While looking into the support for the Soft Hyphen the area of > hyphenation combined with kerning is causing me trouble. > > Aside 1: our fonts page claims we don't support kerning but I believe we > do. Yes, we do. > Aside 2: Our tests are based on setting the base14kerning flag. I don't > think this flag is documented anywhere on the fop site. Can it be set > in the config file? Not yet, no, only through the FopFactory. > I wrote a simple test under fop 0.93 and noticed that the ipd of text > areas with the combination of kerning and hyphenation are not correctly > calculated (at least I think so). With the Base 14 sans-serif font at > 20pt we get: > > Word "vanda" alone on a line: ipd=53980 > Word "vandavanda" alone on a line: ipd=107560 (53980+53980-400) > Word "vandavandavanda" alone on a line: ipd=161140 > (53980+53980+53980-400-400) > If we now take the word "vandavandavandavanda" and get it hyphenated > after the 3rd "vanda" the line containing "vandavandavanda-" reports an > ipd of 168800 which is incorrect. I think it should be 161140 + 6660 > (width of hyphen) = 167800. I think you mean it is 168600 (that's what I get) and it should be 167800, which is a difference of 800mpt. 800mpt looks much like two times 400mpt kerning which are not removed from the area objects's IPD for the break at a hyphenation point. However, I'd expect a difference of only 400mpt, not 800mpt. I did fix [1] the behaviour for that condition so the layout looks correct, but apparently I didn't adjust the IPD for the area object. [1] http://svn.apache.org/viewvc?rev=407774&view=rev > Before I investigate this more can someone else check this and see if > they get similar results. > > I have attached the test case I have constructed. > > Manuel > > > Jeremias Maerki