Re: Request for brainstorming: how to represent item-item diffs efficiently

David Hall Sun, 30 Aug 2009 10:12:48 -0700

On Sun, Aug 30, 2009 at 3:39 AM, Sean Owen<sro...@gmail.com> wrote:
> So the task is to make a data structure that maps a pair of item IDs
> (longs) to a count (int, likely) and a total diff (float, likely). The
> key operation is not actually to retrieve the diff for a single pair,
> but rather, to retrieve all diffs involving some item A (that is, all
> pairs (A,x) or (x,A)). That's what we're optimizing for. It is
> acceptable to assume that no new item IDs will be added at runtime.
> It's also potentially acceptable to use some hashing scheme that
> *might* once in a blue moon return the wrong diff.


Brainstorm:
If the diffs are symmetric, suppose you store both (x,A) and (A,x),
and the set of diffs for the first member of the pair is stored as a
sorted list where the second member is stored as a delta-encoded
gamma-encoded int from the previous pair, the count is also
gamma-encoded (I'm assuming these are usually small?), and then the
float is stored on the next 4-byte boundary?

If the number of diffs is reasonably dense (or at least they're
reasonably contiguous) and the counts are usually small (<8), you're
likely to only pay three or four bytes per entry on top of the float.

May or may not provide any savings, I dunno.

-- David

Re: Request for brainstorming: how to represent item-item diffs efficiently

Reply via email to