Ignoring the diff for the moment, you will probably win with some coding for the actual counts. My vision would be a row-coded complex matrix. Each row would have a list of id's, a list of counts, and a list of diffs.
If you have your ids coded so that the most commonly rated items come first, then simple integer compression should help on the list of id's. The counts are already heavily skewed to small numbers so a bit-variable representation should win big on that side. The diffs are probably pretty degenerate as well. If you look at the distribution of values you will either have a small number of very common values (that can be encoded with small integers) or a pretty tight distribution (that can be encoded by picking ranges at the level of accuracy you want, encoding the range as an integer and then huffman coding the integers). The canonical reference for compressing integers is Williams and Zobel, http://www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf I can try to work up an example implementation for these codes, but my time this week is kind of tight(er). On Mon, Aug 31, 2009 at 11:01 AM, Sean Owen <sro...@gmail.com> wrote: > . I believe I will proceed with investigating > a structure that maps a long (really two ints together) to a count and to a > total diff. > -- Ted Dunning, CTO DeepDyve