Re: Request for brainstorming: how to represent item-item diffs efficiently

Sean Owen Sun, 30 Aug 2009 15:42:21 -0700

On Sun, Aug 30, 2009 at 8:56 PM, Ted Dunning<ted.dunn...@gmail.com> wrote:
> David's approach is close to the right thing.  This is very much like a
> sparse row matrix with special handling for the rows.  All you really need
> is a good class for the rows.  Once you have that, then the table holding
> the rows isn't such a big deal.


Yeah that is really a central question! how to store the rows.

I think a good answer must not involve an object per item-item diff,
since the memory overhead is just terrible. So, perhaps one data
structure for counts and a parallel data structure for total diff.

But what data structure?

> On the other hand, you could build a side table with all possible A's and
> x's in it.  This could be as simple as a sorted list of all values or could
> be a closed hash set (dangerous if you increase the number of entries!).
> This would let you use a normal row sparse matrix.  This has a cost of k * 8
> * number of unique items, but it allows a savings of 4 * number of diffs in
> your table.  If the average number of entries per item is > 2 * k, you win.

Yeah I think you are suggesting a lookup table, which would
effectively translate a long ID into an int offset. Then I can put two
such int "IDs" together into a new long ID for a pair of items. So now
I just have to map longs to something.

And I could have two parallel mappings from these long-pair-IDs to
counts, and to total diffs. That is roughly also the best idea I have
so far.

Re: Request for brainstorming: how to represent item-item diffs efficiently

Reply via email to