On Sat, 9 Apr 2022, Raul Miller wrote:

People will say that certain algorithms, such as hashing, are highly efficient. But these assertions are quite often not accompanied by adequate benchmarking on large datasets. And, these approaches often have inefficient worst case behavior.

Hashing is O(1) (or, if you prefer, O(#y) for ~.y, same as sorting). A sufficiently smart(tm) hash function will avoid inordinate collision rates, so I am not sure what worst case behaviour you are referring to.

If you are able to share your 1e8-sized dataset where sorting before removing duplicates was faster than using ~., that would be great. Maybe I can make it faster :)

For me, ~. on a vector of 1e8 integers is always faster than /:~ alone, regardless of the duplication rate, but the specifics depend on the shape of your data (matrix? boxed? float? note /: is intolerant but i. et al are not by default).

(I will note that the algorithm I am thinking of would only be useful for large datasets--it is useless for small ones--and it has a distinct advantage over sorting in that domain.)

All that aside, though, I think the original question has value even if it turns out that an in-order ~. is always as fast as an out-of-order one. _In general_, putting items in a certain order creates information, and it may be possible to profit by avoiding the need to create that information.

It occurs to me that there is a cop-out method: use !., as in ~.!.1.  This
has precedent in i.!.1, but I do not like it. And I think the method can also be used for -., resulting in ambiguities: should x -.!.1 y mean the order of the result does not matter, or that the arguments are sorted? (I know there is currently no -.!.1, but it could plausibly exist.)
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to