Hi, I am using the Apache Mahout's org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob with the input options
1. --booleanData 2. --similarityClassname SIMILARITY_LOGLIKELIHOOD The loglikelihood similarity algorithm expects a numeric input. However, I have a textual data. One of the things, I did was to write a trivial standalone java program to convert the unique text value to a unique long value, which does the following. 1. Maintain a Map such that key is the unique text value and the value is the unique long value. Map<String, Long>. 2. Before we insert the key, we can lookup the Map, if a key-exists, do not create a new Long value. If a key does not exist, increment the counter value and insert it to the Map. However, for large data sets, this may have a limitation since the map size grows with the number of unique text values. There are couple of ways to do this 1. Create a database table, with a constraint of unique text value ( a primary key). Query the table before inserting a new long value. I am guessing, this may be slow. 2. Whatever, hashing algorithm that I may chose, there's a possibility of collision and there's no guarantee for a unique long value for a given unique text value. Are there any better ways to solve this for a large data set? Thanks, Ramu