On Tue, 19 May 2015 01:25:54 +0200 Philippe Verdy <verd...@wanadoo.fr> wrote:
> I don't work with strings, but with what you seem to call "traces", For the concept of traces, Wikipedia suffices: https://fr.wikipedia.org/wiki/Mono%C3%AFde_des_traces . As far as text manipulation is concerned, the word 'trace' is an idealisation of how Latin text is written. Base letters advance the writing point, so they commute with nothing - canonical combining class 0. Ideally, marks of different canonical combining classes do not interact with one another when writing, so they commute. In general, marks of the same canonical combining class interact with one another, be it only to move the subsequent one further from the base letter, so they do not commute. The traces I refer to are the equivalence classes of Unicode modulo canonical equivalence. To apply the theory, I have to regard decomposable characters as notations for sequences of 1 to 4 indecomposable characters. The notion works with compatibility equivalence, and one could use a stronger notion of equivalence, so that compatibility ideographs did not have singleton decompositions. Thus, as strings, \u0323\u0302 and \u0302\u0323 are distinct, but as traces, they are identical. The lexicographic normal form that is most useful is simply NFD. The indecomposable characters are ordered by canonical combining class and then it doesn't matter; one may as well use codepoint. > but that I call sets of states (they are in fact bitsets, which may be > compacted or just stored as arrays of bytes containing just 1 usefull > bit, but which may be a bit faster; byte arrays are just simpler to > program)., in a stack (I'll use bitsets later to make the structure > more compact, if needed, but for now this is fast enough and not > memory intensive even for large regexps with many repetitions with > "+/*/{m,n}" or variable parts). Your 'bitset' sounds like a general purpose type, and to be an implementation detail that surfaces in your discussion. > The internal matcher uses NFD, but > needs to track the positions in the original buffered input for > returning captured matches. That's how I'm working. I do not regard decomposable characters as atomic; I am emotionally happy with working with fractions of characters. > ... Greek, Cyrillic and Arabic, but also too few for Hebrew where > "pathological" cases of regexps are certainly more likely to occur > than in Latin, even with Vietnamese and its frequent double > diacritics). I was just thinking respecting canonical equivalence might be very useful for Hebrew, particularly when dealing with text with accents. > Finally a question: > > I suppose that like many programmers you have read the famous "green > dragon" book of Sethi/Aho/Ullman books about compilers. I can > understand the terminology they use when spoeaking about automatas > (and that is found in many other places), but apparently you are > using some terms that I have to guess from their context. No, I started off by hunting the web to try and work out what was special about a regular expression, and found the articles in Wikipedia quite helpful. When working out how to make matching respect canonical compliance, I started out with normalising strings to NFD. Only after I had generalised the closure properties of regular languages from strings to these representative forms (with the exception of Kleene star) did I finally discover what I had long suspected, that I was not the first person to investigate regular expressions on non-free monoids. What does surprise me is that I cannot find any evidence that any one else has made the connection between trace monoids and Unicode strings under canonical equivalence. I would like update the article on the trace monoid with its most important example, Unicode strings under canonical equivalence, but, alas, that seems to be 'original research'! I'm beginning to think that 'letting the regex choose the input character' might be a better method of dealing with interleaving of subexpressions even for 'non-deterministic' engines, i.e. those which follow all possible paths in parallel. I'll have to compare the relevant complexities. > Good books on the subjext are now becoming difficutlt to find (or > they are more expensive now), and too difficult to use on the web > (for such very technical topics, it really helps to have a printed > copy, that you an annotate, explore, or have beside you instead of on > a screen (and printing ebooks is not an option if they are > voluminous). May be you have other books to recommend. Google Books, in English, gives access to a very helpful chapter on regular languages in trace monoids in 'the Book of Traces'. I found Russ Cox's Internet notes on regular expressions helpful, though not everyone agrees with his love of non-determinism. Richard.