Re: More code style; common packages?

Sean Owen Sun, 06 Sep 2009 23:56:16 -0700

On Mon, Sep 7, 2009 at 2:15 AM, Robin Anil<robin.a...@gmail.com> wrote:
> The FastMap, BitVector and other classes in taste.common are being used(or
> should be used)  by other packages. We can start our own collections package
> say ...mahout.collections ?


They could be used, but could need some work. Those 'Map' and 'Set'
classes only work with longs, and don't fully implement the Map/Set
interface (they could). If someone has a use case, holler at me and
I'll move it out.

The possibly useful classes from taste.common are all the
Iterator/Iterable related items. FileLineIterator for iterating over a
file's lines, SkippingIterator interface for being able to skip
forward, PermutingIterator for iterating over a permutation of a
collection, SamplingIterator for iterating over only x% of something.

RandomUtils might be useful in that it provides a standard way to get
a PRNG, and I think there are three or four implementations used in
the code at the moment. It also lets one globally select 'test mode',
wherein all the PRNGs are seeded with the same seed every time, to
make unit tests reproducible.


> about Cache: the reason why both implementation differ is. At one place the
> datastore/retriever code is extended by the cache class. The other place,
> the datastore is independent of the cache and the application or the
> algorithm uses cache explicitly to get data or fetch+put data into the
> cache. Plus it didnt make sense to make Hbase retriever of the form <K,V>
> because, Hbase is used as getCell(row, columnfamily, column) or
> getFamily(row, columnfamily) or getRow(row)

I wouldn't push on it, but to continue the thread, I think it's
possible to do anything with either implementation.

My pet theory about caches is the caller should only need one
operation -- get(K key). The cache is instantiated with an object that
can load a value for a key. Given that it can do all its work.
Otherwise every caller has to have knowledge about how to manage the
cache and how to load values for keys. I am not sure, maybe this is
partly what you are getting at, in which case this approach solves
some of those issues?

... and since I'm on that topic now, it can get expensive to implement
strategies like LRU or LFU. It needs a whole new set of parallel data
structures to track access time, etc. In the end, this mattered so
much for my purposes that I went with something that makes slightly
dumber decisions about what to evict but is much leaner. And, I'd
suggest that in most cases, it is not important to find the absolute
oldest thing to evict, but just something kinda old.

I don't know what this is called, but what I do is hold a bitset with
a bit for each entry. When the entry is accessed, the bit is set to 1.
When the cache fills, a random item is evicted -- but if we choose an
item whose bit is 1, we just set it to 0 and pick something else. So
recently used items get a 'second chance' to escape eviction. Quite
lean, and works fine.

Anyhow I might suggest this eviction strategy business could be a
motivation to unify the caches, at which time it might become clearer
that there's probably little barrier to doing so.


> about Pair: I really didnt see that. Again, lets move all these helper
> classes out of taste and ensure its getting reused by other algorithms as
> well. And it will also ease adding more trove/colt like collections classes

As a warmup I'll put together a patch that unifies all this.

Re: More code style; common packages?

Reply via email to