On Tue, 5 Oct 2004, Khamenia, Valery wrote:
BTW, do you mean that current hash-based implementation brings *clearly*
better performance than any O(n*log(n)) sort based algorithm?
If I have correctly understood src/main/unique.c then current
hash function is niether minimal perfect hash function nor even
minimal hash function. In addition, as might be expected,
current hash function uses full pass through the string to get
a hash key.

So, in particular, can anyone clearly show that the current
hash-based algorithm will be quicker then sort-based
algorithm if the input has:

1. a lot of strings;
2. strings are very long;
3. strings are quite unsimilar

?

Hm, I don't believe you are ready to prove smth. like that.

Clearly the algorithm is not optimal for all possible sets of inputs (eg if the inputs are already known to be unique then an even faster implementation is just to do nothing).


As R's string comparison function use C's strcmp, for which the C standard makes no performance guarantees whatsoever, it is not possible to prove *anything* about the relative performance of algorithms without making some additional assumptions.

However, if this was a serious question, it is known that
a) in some versions of R on some byte-orderings the hash was broken and the performance was far inferior, suggesting that it is ordinarily quite effective.
b) Switching the rowsum function from a sort-based implementation to hashing produced a substantial speed increase.


        -thomas

______________________________________________
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Reply via email to