On Mon, Sep 7, 2009 at 2:15 AM, Robin Anil<robin.a...@gmail.com> wrote: > The FastMap, BitVector and other classes in taste.common are being used(or > should be used) by other packages. We can start our own collections package > say ...mahout.collections ?
They could be used, but could need some work. Those 'Map' and 'Set' classes only work with longs, and don't fully implement the Map/Set interface (they could). If someone has a use case, holler at me and I'll move it out. The possibly useful classes from taste.common are all the Iterator/Iterable related items. FileLineIterator for iterating over a file's lines, SkippingIterator interface for being able to skip forward, PermutingIterator for iterating over a permutation of a collection, SamplingIterator for iterating over only x% of something. RandomUtils might be useful in that it provides a standard way to get a PRNG, and I think there are three or four implementations used in the code at the moment. It also lets one globally select 'test mode', wherein all the PRNGs are seeded with the same seed every time, to make unit tests reproducible. > about Cache: the reason why both implementation differ is. At one place the > datastore/retriever code is extended by the cache class. The other place, > the datastore is independent of the cache and the application or the > algorithm uses cache explicitly to get data or fetch+put data into the > cache. Plus it didnt make sense to make Hbase retriever of the form <K,V> > because, Hbase is used as getCell(row, columnfamily, column) or > getFamily(row, columnfamily) or getRow(row) I wouldn't push on it, but to continue the thread, I think it's possible to do anything with either implementation. My pet theory about caches is the caller should only need one operation -- get(K key). The cache is instantiated with an object that can load a value for a key. Given that it can do all its work. Otherwise every caller has to have knowledge about how to manage the cache and how to load values for keys. I am not sure, maybe this is partly what you are getting at, in which case this approach solves some of those issues? ... and since I'm on that topic now, it can get expensive to implement strategies like LRU or LFU. It needs a whole new set of parallel data structures to track access time, etc. In the end, this mattered so much for my purposes that I went with something that makes slightly dumber decisions about what to evict but is much leaner. And, I'd suggest that in most cases, it is not important to find the absolute oldest thing to evict, but just something kinda old. I don't know what this is called, but what I do is hold a bitset with a bit for each entry. When the entry is accessed, the bit is set to 1. When the cache fills, a random item is evicted -- but if we choose an item whose bit is 1, we just set it to 0 and pick something else. So recently used items get a 'second chance' to escape eviction. Quite lean, and works fine. Anyhow I might suggest this eviction strategy business could be a motivation to unify the caches, at which time it might become clearer that there's probably little barrier to doing so. > about Pair: I really didnt see that. Again, lets move all these helper > classes out of taste and ensure its getting reused by other algorithms as > well. And it will also ease adding more trove/colt like collections classes As a warmup I'll put together a patch that unifies all this.