For the insertion stage, the Dict is likely ideal.

For the second stage, a Dict that is based on hashing is a good data
structure. Another data structure worth examining would be a sorted vector
of `(Int64, Float64)` pairs (sorted by the `Int64` keys). Interpolation
search <https://en.wikipedia.org/wiki/Interpolation_search> can then have a
complexity of `O(log log N)`. My personal guess is that a well-optimized
Dict (i.e. with both hash function and hash table size "optimized") will be
faster if we assume a close to random access, as it will have fewer memory
accesses.

Julia's Dict implementation (see base/dict.jl) has constants that you could
tune (read: play with). There is also a function `Base.rehash!` that you
can call to increase the size of the hash table, which might increase
performance by avoiding hash collisions, if you have sufficient memory.

-erik


On Wed, Aug 17, 2016 at 7:58 PM, 'Greg Plowman' via julia-users <
julia-users@googlegroups.com> wrote:

> I need fast random access to a Dict{Int64,Float64}.
> My application has a first phase in which the Dict is populated, and a
> second phase where it accessed randomly (with no further additions or
> deletions).
> There are about 50,000 to 100,000 entries, with keys in the range 10^9 to
> 10^14.
>
> Firstly is a Dict the most optimal data structure? (presumably Dict is
> faster than SparseVector for random lookup?)
>
> Secondly, is there any "preconditioning" I can do after creating the Dict
> in phase 1 but before phase 2, to optimise the Dict for random access.
> (e.g. sizehint, sorting keys)
>
> I do appreciate access might already be fast and optimal and I should just
> benchmark, but I'm just looking for some theoretical
> insight before benchmarking.
>
>
>


-- 
Erik Schnetter <schnet...@gmail.com>
http://www.perimeterinstitute.ca/personal/eschnetter/

Reply via email to