MemoryDiffStorage is the only class in Mahout that uses CompactRunningAverage. If all of the compact averages are in two arrays, the benefits of hardware caching go way up. This seems a more compelling reason to recode MemoryDiffStorage :)
Lance On Fri, Mar 25, 2011 at 9:22 PM, Lance Norskog <goks...@gmail.com> wrote: >> I think that's the bit you're missing. > 8 bits I'm missing, actually. BA-ZING! > > 64-bit implementations of Java align on 8-byte boundaries. A 2-byte > value plus a 4-byte value will be packed into 8 bytes. An 8-byte > double and a 4-byte int will add 4 useless bytes with alignment, > giving 16 bytes.. > > Including 16 bytes per object, Compact v.s. Full is 24 bytes v.s. 32 > bytes. Since there are 2 unused bytes in the Compact object, the > 16-bit 'char' counter might as well be a 31-positive-bit counter. > Similarly, the integer counter in FullRunningAverage could be a long > instead of an int. > > For comparison, an empty String is 40 bytes, or 5 8-byte words. > > On Fri, Mar 25, 2011 at 2:33 AM, Sean Owen <sro...@gmail.com> wrote: >> A char is a 2-byte unsigned value in Java, not 1-byte signed -- I >> think that's the bit you're missing. >> >> Ignoring object overhead, there are 2 (char) + 4 (float) = 6 bytes of >> payload in CompactRunningAverage. There are 4 (int) + 8 (double) = 12 >> bytes of payload in FullRunningAverage. >> >> I don't know what you mean that 'count' should be an int. >> >> I do agree that it would be more memory-efficient to make a sort of >> array of these things, in cases where you have a dense sequence of >> them. However the use case you cite isn't one of those -- those diffs >> are sparse. You can add into the object the row/col position but then >> that just eats up the space savings. >> >> On Fri, Mar 25, 2011 at 9:19 AM, Lance Norskog <goks...@gmail.com> wrote: >>> The CompactRunningAverage class is >>> 1) not compact, and >>> 2) not coded right. >>> >>> The code using 'count' seems to think it is a short. The code assumes >>> that shorts top out at 65535, >>> but shorts are signed and thus top out at 32767. >>> 'count' is declared as a 'char' so it will never reach 128, let alone 65536. >>> The 'count' field uses 4 bytes so it should be an int. >>> Any object has 16 bytes. The CompactRunningAverage has (at least) 24 >>> bytes. FullRunningAverage has (at least) 32 bytes. >>> >>> Something that supplies compact running averages should directly >>> allocate and manage an array of shorts and an array of floats. >>> MemoryDiffStorage is the only use of CompactRunningAverage in the code >>> base. It can use "hundreds of thousands" of these counters. >>> It also needs 8 bytes per counter for each object reference, v.s. 4 >>> bytes per array index. >>> >>> -- >>> Lance Norskog >>> goks...@gmail.com >>> >> > > > > -- > Lance Norskog > goks...@gmail.com > -- Lance Norskog goks...@gmail.com