Ramu, sorry for the belated response but if you're still interested you may want to try the new version of item similarity, which is described some in this article: https://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html
Best Andrew On Thu, Sep 20, 2018 at 5:10 AM Ramu Ramaiah <[email protected]> wrote: > Hi, > I am using the Apache Mahout's > org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob > with the input options > > 1. --booleanData > 2. --similarityClassname SIMILARITY_LOGLIKELIHOOD > > The loglikelihood similarity algorithm expects a numeric input. However, I > have a textual data. One of the things, I did was to write a trivial > standalone java program to convert the unique text value to a unique long > value, which does the following. > > 1. Maintain a Map such that key is the unique text value and the value is > the unique long value. Map<String, Long>. > 2. Before we insert the key, we can lookup the Map, if a key-exists, do not > create a new Long value. If a key does not exist, increment the counter > value and insert it to the Map. > > However, for large data sets, this may have a limitation since the map size > grows with the number of unique text values. > > There are couple of ways to do this > > 1. Create a database table, with a constraint of unique text value ( a > primary key). Query the table before inserting a new long value. I am > guessing, this may be slow. > 2. Whatever, hashing algorithm that I may chose, there's a possibility of > collision and there's no guarantee for a unique long value for a given > unique text value. > > Are there any better ways to solve this for a large data set? > > Thanks, > Ramu >
