On 5/31/16 1:54 PM, qznc wrote:
On Tuesday, 31 May 2016 at 01:55:16 UTC, Andrei Alexandrescu wrote:
I agree it's difficult to characterize the behavior of substring
search with one number. There are many dimensions of variation. (But
there's no reason for an emotional response.) A few possible baselines
come to mind:

* Search a long string for a one-character string, match and fail.

There is a special version of find for searching a single char in a
string. Using a one-letter needle string is more like a user mistake
than something to optimize for.

Oh, I see what you mean - it's a string consisting only of one character, not one character 'x'. I agree, but I'm just saying it's a sound baseline.

* Take an English text string. Search for a substring consisting of
its last portion (e.g. 5% of the length).

How long should the english text be? A Tweet? A book? A Gigabyte of log
files?

I'm thinking of a few exponentially increasing sizes. e.g. 100, 1000, 10_000, 100_000, 1_000_000.

English text means basically ASCII and no Unicode?

My understanding (and correct me if I'm wrong) is that we're discussing search with the default predicate, ignoring equivalent grapheme representations etc. So there's no need to mind encoding at all, which means the text could be in any language. Foreign languages would further increase the vocabulary, but that's the only effect.

* Take an English text string. Search for a substring consisting of a
fraction of the text (e.g. 3%) with additional characters prepended.
Repeat for appended.

Why the prepend/append? To force a mismatch?

Yah, and it's a good test for strings that have some equal portion yet they are different, which seems to me puts a greater strain on the algorithms. It is arguably a likely situation in the real world.


Thanks,

Andrei

Reply via email to