Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

spir Mon, 17 Jan 2011 15:14:16 -0800

On 01/17/2011 07:57 PM, Andrei Alexandrescu wrote:

On 1/17/11 12:23 PM, spir wrote:

Andrei, would you have a look at Text's current state, mainly
theinterface, when you have time for that (no hurry) at
https://bitbucket.org/denispir/denispir-d/src
It is actually a bit more than just a string type considering true
characters as natural elements.
* It is a textual type providing a client interface of common text
manipulation methods similar to ones in common high-level languages.
(including the fact that a character is a singleton string)
* The repo also holds the main module (unicodedata) of Text's sister lib
(dunicode), providing access to various unicode algos and data.
(We are about to merge the 2 libs into a new repository.)


I think this is solid work that reveals good understanding of Unicode.
That being said, there are a few things I disagree about and I don't
think it can be integrated into Phobos.

We are exploring a new field. (Except for the work Objective-C designersdid -- but we just discovered it.)

One thing is that it looks a lot
more like D1 code than D2. D2 code of this kind is automatically
expected to play nice with the rest of Phobos (ranges and algorithms).
As it is, the code is an island that implements its own algorithms
(mostly by equivalent handwritten code).

Right. We precisely initially wanted to let it play nicely with the restof new Phobos. This meant mainly provide a range interface, which alsogives access to std.algorithm routines. But we were blocked by currentbugs related to ranges. I have posted about those issues (you mayremember having replied to this post).

In detail:

* Line 130: representing a text as a dchar[][] has its advantages but
major efficiency issues. To be frank I think it's a disaster. I think a
representation building on UTF strings directly is bound to be vastly
better.

I don't understand your point. Where is the difference with D's builtintypes, then?

Also, which efficiency issue do you mention? Upon text objectconstruction, we do agree and I have given some data. But this happensonly once; it is an investment intended to provide correctness first,and efficiency of _every_ operation on constructed text.Upon speed ofsuch methods / algorithms operating _correctly_ onuniversal text, precisely, since there is no alternative to Text (yet),there are also no available performance data to judge.

(What about comparing Objective-C's NSString to Text's currentperformance for indexing, slicing, searching, counting,...? Even in itscurrent experimental stage, I bet it would not be ridiculous, rather theopposite. But I may be completely wrong.)

* 163: equality does what std.algorithm.equal does.

* 174: equality also does what std.algorithm.equal does (possibly with a
custom pred)

Right, these are unimportant tool func at the "pile" level. (Initiallyintroduced because builtin "==" showed strange inefficency in our case.May test again later.)

* 189: TextException is unnecessary


Agreed.

* 340: Unless properly motivate, iteration with opApply is archaic and
inefficient.


See range bug evoked above. opApply is the only workaround AFAIK.
Also, ranges cannot yet provide indexed iteration like
        foreach(i, char ; text) {...}

* 370: Why lose the information that the result is in fact a single Pile?


I don't know what information loss you mean.

Generally speaking, Pile is more or less an implementation detail usedto internally represent a true character; while Text is the important thing.At one time we had to chose whether make Pile an obviously exposed typeas well, or not. I chose (after some exchange on the topic) not to do itfor a few reasons:

* Simplicity: one type does all the job well.

* Avoid confusion due to conflict with historic string types whichelements (codes=characters) were atomic thingies. This was also a reasonnot to name it simply "Character"; "Pile" for me was supposed to ratherevoke the technical side than the meaningful side.* Lightness of the interface: if we expose Pile obviously, then we needto double all methods that may take or return a single character, likesearching, counting, replacing etc... and also possibly indexing anditeration.

In fact, the resulting interface is more or less like a string type inhigh-level languages such as Python; with the motivating difference thatit operates correctly on universal text.

Now, it seems you rather expect, maybe, the character/pile type to bethe important thing and Text to just be a sequence of them? (possiblyeven unnecessary to be defined formally)

* 430, 456, 474: contains, indexOf, count and probably others should use
generic algorithms, not duplicate them.

* 534: replace is std.array.replace

I had to write algos because most of them in std.algorithm require arange interface, IIUC; and also for testing purpose.

* 623: copy copies the piles shallowly (not sure if that's a problem)


Had the same interrogation.

As I mentioned before - why not focus on defining a Grapheme type (what
you call Pile, but using UTF encoding) and defining a ByGrapheme range
that iterates a UTF-encoded string by grapheme?

Dunno. This simply was not my approach. Seems to me Text as is providesclients with an interface a simple and clear as possible, whileoperating correctly in the backgroung.

It seems if you just build a ByGrapheme iterator, then you have no otherchoice than abstracting on the fly (constructing piles on the fly foroperations like indexing and normalising them in addition for searching,counting...).As I said in other posts, this may be the right thing to do from anefficiency point of view, but this remains to be proven. I bet theopposite, in fact, that --with same implementation language and sameinvestment in optimisation-- the approach defining a true textual typelike Text is inevitbly more efficient by orders of magnitude (*). Again,Text construction initial cost is an investment. Prove me wrong (**).

Andrei


Denis

(*) Except, probably, for the choice of making the ElemenType asingleton Text (seems costly).(**) I'm now aware of the high speed loss Text certainly suffers fromrepresenting characters as mini-arrays, but I guess it is marginallyrelevant compared to the gain of not piling and normalising for everyoperation.

_________________
vita es estrany
spir.wikidot.com

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Reply via email to