Re: Request for brainstorming: how to represent item-item diffs efficiently

Ted Dunning Mon, 31 Aug 2009 11:30:58 -0700

Ignoring the diff for the moment, you will probably win with some coding for
the actual counts.  My vision would be a row-coded complex matrix.  Each row
would have a list of id's, a list of counts, and a list of diffs.

If you have your ids coded so that the most commonly rated items come first,
then simple integer compression should help on the list of id's.  The counts
are already heavily skewed to small numbers so a bit-variable representation
should win big on that side.  The diffs are probably pretty degenerate as
well.  If you look at the distribution of values you will either have a
small number of very common values (that can be encoded with small integers)
or a pretty tight distribution (that can be encoded by picking ranges at the
level of accuracy you want, encoding the range as an integer and then
huffman coding the integers).

The canonical reference for compressing integers is Williams and Zobel,
http://www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf

I can try to work up an example implementation for these codes, but my time
this week is kind of tight(er).

On Mon, Aug 31, 2009 at 11:01 AM, Sean Owen <sro...@gmail.com> wrote:

> . I believe I will proceed with investigating
> a structure that maps a long (really two ints together) to a count and to a
> total diff.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Request for brainstorming: how to represent item-item diffs efficiently

Reply via email to