Re: Request for brainstorming: how to represent item-item diffs efficiently

Ted Dunning Sun, 30 Aug 2009 12:56:57 -0700

David's approach is close to the right thing.  This is very much like a
sparse row matrix with special handling for the rows.  All you really need
is a good class for the rows.  Once you have that, then the table holding
the rows isn't such a big deal.

Since your indexes are essentially random Long values, differential coding
is unlikely to help very much until you have massive numbers of entries in a
row.  Even with tens of millions of entries per row, you will only be saving
less than 3 bytes out of 8.

On the other hand, you could build a side table with all possible A's and
x's in it.  This could be as simple as a sorted list of all values or could
be a closed hash set (dangerous if you increase the number of entries!).
This would let you use a normal row sparse matrix.  This has a cost of k * 8
* number of unique items, but it allows a savings of 4 * number of diffs in
your table.  If the average number of entries per item is > 2 * k, you win.

On Sun, Aug 30, 2009 at 10:12 AM, David Hall <d...@cs.berkeley.edu> wrote:

> On Sun, Aug 30, 2009 at 3:39 AM, Sean Owen<sro...@gmail.com> wrote:
> > So the task is to make a data structure that maps a pair of item IDs
> > (longs) to a count (int, likely) and a total diff (float, likely). The
> > key operation is not actually to retrieve the diff for a single pair,
> > but rather, to retrieve all diffs involving some item A (that is, all
> > pairs (A,x) or (x,A)). That's what we're optimizing for. It is
> > acceptable to assume that no new item IDs will be added at runtime.
> > It's also potentially acceptable to use some hashing scheme that
> > *might* once in a blue moon return the wrong diff.
>
> Brainstorm:
> If the diffs are symmetric, suppose you store both (x,A) and (A,x),
> and the set of diffs for the first member of the pair is stored as a
> sorted list where the second member is stored as a delta-encoded
> gamma-encoded int from the previous pair, the count is also
> gamma-encoded (I'm assuming these are usually small?), and then the
> float is stored on the next 4-byte boundary?
>
> If the number of diffs is reasonably dense (or at least they're
> reasonably contiguous) and the counts are usually small (<8), you're
> likely to only pay three or four bytes per entry on top of the float.
>
> May or may not provide any savings, I dunno.
>
> -- David
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Request for brainstorming: how to represent item-item diffs efficiently

Reply via email to