On Monday, 11 June 2012 at 14:48:27 UTC, Dmitry Olshansky wrote:
On 11.06.2012 16:11, Roman D. Boiko wrote:
On Tuesday, 5 June 2012 at 13:27:12 UTC, Roman D. Boiko wrote:
maybe it is better to avoid immutability... or do bulk ins
/del before
copy.
Would it be difficult to introduce two optimizations:
* have strings of all lengths above configurable threshold go
to the same bucket (is it reasonable?)
The sample I listed did just that: for length >= 64 it goes to
bucket of 0-length strings. After all it's your prefix
function, with any distribution you like ;)
Great
* have ability to store a checkpoint, then do a series of
modifications,
and store another checkpoint but reuse data which have not
been affected.
yea, that's a rough sketch of what I want to support.
That's wonderful :)
For the second optimization a possible design would be to
store internal state as snapshot + diffs, and apply diffs when
creating another snapshot. Diff format should allow
efficiently performing trie interface operations.
It may be. Diff could be a bunch of pages that are XOR-ed on
top of snapshot. Dunno if it's worthwhile trick yet.
It should be, provided that a single insert / delete doesn't
affect many pages and size of page is reasonably small. Only need
to create pages corresponding to those which are affected; and
there is no need to track separate diffs between snapshots, so
they can be combined.
Another option might be creating a separate trie for insertions
and tracking deletions somehow, provided that tries can be merged
efficiently.
Actually, in my case, deletions could be deferred and performed
in bulk. OTOH, I will need to count how many times a string is
inserted minus number of times it has been deleted.
Alternatively, I could just check from time to time which strings
are not needed any more (micro-GC :) ).
There are many possible data structures, but the one you
mentioned seems to be the most reasonable.