On 01/17/2011 07:57 PM, Andrei Alexandrescu wrote:
On 1/17/11 12:23 PM, spir wrote:
Andrei, would you have a look at Text's current state, mainly
theinterface, when you have time for that (no hurry) at
https://bitbucket.org/denispir/denispir-d/src
It is actually a bit more than just a string type considering true
characters as natural elements.
* It is a textual type providing a client interface of common text
manipulation methods similar to ones in common high-level languages.
(including the fact that a character is a singleton string)
* The repo also holds the main module (unicodedata) of Text's sister lib
(dunicode), providing access to various unicode algos and data.
(We are about to merge the 2 libs into a new repository.)

I think this is solid work that reveals good understanding of Unicode.
That being said, there are a few things I disagree about and I don't
think it can be integrated into Phobos.

We are exploring a new field. (Except for the work Objective-C designers did -- but we just discovered it.)

One thing is that it looks a lot
more like D1 code than D2. D2 code of this kind is automatically
expected to play nice with the rest of Phobos (ranges and algorithms).
As it is, the code is an island that implements its own algorithms
(mostly by equivalent handwritten code).

Right. We precisely initially wanted to let it play nicely with the rest of new Phobos. This meant mainly provide a range interface, which also gives access to std.algorithm routines. But we were blocked by current bugs related to ranges. I have posted about those issues (you may remember having replied to this post).

In detail:

* Line 130: representing a text as a dchar[][] has its advantages but
major efficiency issues. To be frank I think it's a disaster. I think a
representation building on UTF strings directly is bound to be vastly
better.

I don't understand your point. Where is the difference with D's builtin types, then?

Also, which efficiency issue do you mention? Upon text object construction, we do agree and I have given some data. But this happens only once; it is an investment intended to provide correctness first, and efficiency of _every_ operation on constructed text. Upon speed ofsuch methods / algorithms operating _correctly_ on universal text, precisely, since there is no alternative to Text (yet), there are also no available performance data to judge.

(What about comparing Objective-C's NSString to Text's current performance for indexing, slicing, searching, counting,...? Even in its current experimental stage, I bet it would not be ridiculous, rather the opposite. But I may be completely wrong.)

* 163: equality does what std.algorithm.equal does.

* 174: equality also does what std.algorithm.equal does (possibly with a
custom pred)

Right, these are unimportant tool func at the "pile" level. (Initially introduced because builtin "==" showed strange inefficency in our case. May test again later.)

* 189: TextException is unnecessary

Agreed.

* 340: Unless properly motivate, iteration with opApply is archaic and
inefficient.

See range bug evoked above. opApply is the only workaround AFAIK.
Also, ranges cannot yet provide indexed iteration like
        foreach(i, char ; text) {...}

* 370: Why lose the information that the result is in fact a single Pile?

I don't know what information loss you mean.

Generally speaking, Pile is more or less an implementation detail used to internally represent a true character; while Text is the important thing. At one time we had to chose whether make Pile an obviously exposed type as well, or not. I chose (after some exchange on the topic) not to do it for a few reasons:
* Simplicity: one type does all the job well.
* Avoid confusion due to conflict with historic string types which elements (codes=characters) were atomic thingies. This was also a reason not to name it simply "Character"; "Pile" for me was supposed to rather evoke the technical side than the meaningful side. * Lightness of the interface: if we expose Pile obviously, then we need to double all methods that may take or return a single character, like searching, counting, replacing etc... and also possibly indexing and iteration.

In fact, the resulting interface is more or less like a string type in high-level languages such as Python; with the motivating difference that it operates correctly on universal text.

Now, it seems you rather expect, maybe, the character/pile type to be the important thing and Text to just be a sequence of them? (possibly even unnecessary to be defined formally)

* 430, 456, 474: contains, indexOf, count and probably others should use
generic algorithms, not duplicate them.

* 534: replace is std.array.replace

I had to write algos because most of them in std.algorithm require a range interface, IIUC; and also for testing purpose.

* 623: copy copies the piles shallowly (not sure if that's a problem)

Had the same interrogation.

As I mentioned before - why not focus on defining a Grapheme type (what
you call Pile, but using UTF encoding) and defining a ByGrapheme range
that iterates a UTF-encoded string by grapheme?

Dunno. This simply was not my approach. Seems to me Text as is provides clients with an interface a simple and clear as possible, while operating correctly in the backgroung.

It seems if you just build a ByGrapheme iterator, then you have no other choice than abstracting on the fly (constructing piles on the fly for operations like indexing and normalising them in addition for searching, counting...). As I said in other posts, this may be the right thing to do from an efficiency point of view, but this remains to be proven. I bet the opposite, in fact, that --with same implementation language and same investment in optimisation-- the approach defining a true textual type like Text is inevitbly more efficient by orders of magnitude (*). Again, Text construction initial cost is an investment. Prove me wrong (**).

Andrei

Denis


(*) Except, probably, for the choice of making the ElemenType a singleton Text (seems costly). (**) I'm now aware of the high speed loss Text certainly suffers from representing characters as mini-arrays, but I guess it is marginally relevant compared to the gain of not piling and normalising for every operation.
_________________
vita es estrany
spir.wikidot.com

Reply via email to