On 07/18/2010 04:30 PM, Philippe Sigaud wrote:
On Sun, Jul 18, 2010 at 00:55, Andrei Alexandrescu  wrote:
You could consider than find() is a multiple-returns function: it may
return 0, 1 or many values (the successive finds). Which can, obviously,
be easily represented by a range. I see find() as bit like filter(): its
return type is a range determined by the predicate and whose elements
are the rest of the input range.
so:
find([ 1, 4, 2, 3, 4, 0 ], 4) returns a range whose elements are
[4,2,3,4,0] and [4,0].

If no needle is present in haystack, then the result range is empty.
It's just it's an empty range of ranges, not an empty version of the
original range.

A complement information can be how many steps were taken (that is, the
index of the found elements):  ([4,2,3,4,0], 1), ([4,0], 4). A policy or
another function could take care of that: findIndexed(haystack, needle)
or find!(withIndex)(haystack, needle).

I agree that a findAll function yielding a range is nice, however it's not the simplest interface. I think there should be a function that just finds something somewhere.

Now, for many needles... Currently, you pass the predicate "a==b" to
find() and use it on all needles. A possible generalization is to have a
isOneOf() predicate, called with isOneOf(a, b0, b1, ... bn) where a is
the current element being tested and the b's are the needles passed to
find(). This generic calling of pred with (a, needles) could be done for
any predicate, like for example:

find!(isBetween)(haystack, 0, 10)

Aren't such scenarios better served by putting the limits inside the predicate? Otherwise the list of what can be done could go on forever.

The limit of this is you cannot easily get the index of the found
needle. btw, could you remind us why you used a 1-based index? My
opinion on this is that if you want to find any needle, then you mostly
care for them being present, and much less for wich one is the first.

I used one-based index to make tests easier - as you said, most often you care for the presence so zero/nonzero is easiest to tell. Then I thought it wouldn't hurt to provide the index since it's not any extra work.

Also, the found element is accessible by taking the front of the retuned
range. Why provide it to the caller?

Because that doesn't generalize well to searching ranges (as oposed to individual elements).

Another problem I see is to get the nice current behavior of having a
needle being a range like haystack:

find([0,1,2,3,4,3,2,1], [4,3]) => [4,3,2,1]

And that's I particularly like.

Not sure I understand whether that's good or bad, but I also want to provide the likes of findSkip() (it's been discussed here under a different name, findConsume) that positions the searched range after the found range:

findSkip([0,1,2,3,4,3,2,1], [4,3]) => [2,1]

It's very useful in practice.

*****
     Another aspect I'd like to discuss is use of Boyer-Moore and
related fast finding techniques. Currently the use of B-M is explicit
and a bit awkward. I was thinking instead to automatically detect its
appropriatedness and use it transparently. Bearophile posted at a link
to such a blend string search routine
(http://effbot.org/zone/stringlib.htm) that I think can be generalized
quite easily.
*****

Indeed, reading this, it seems like it could be generalized easily. You
need a random-access range with a defined length. So the usage of such
an algo could indeed be done transparently. Hmm...

The current B-M implementation already works with any random-access range with length, but it's not integrated transparently with the regular find. I wonder what the threshold conditions might look like.

Andrei

Reply via email to