Hi,
I am using the Apache Mahout's
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
with the input options

1. --booleanData
2. --similarityClassname SIMILARITY_LOGLIKELIHOOD

The loglikelihood similarity algorithm expects a numeric input. However, I
have a textual data. One of the things, I did was to write a trivial
standalone java program to convert the unique text value to a unique long
value, which does the following.

1. Maintain a Map such that key is the unique text value and the value is
the unique long value. Map<String, Long>.
2. Before we insert the key, we can lookup the Map, if a key-exists, do not
create a new Long value. If a key does not exist, increment the counter
value and insert it to the Map.

However, for large data sets, this may have a limitation since the map size
grows with the number of unique text values.

There are couple of ways to do this

1. Create a database table, with a constraint of unique text value ( a
primary key). Query the table before inserting a new long value. I am
guessing, this may be slow.
2. Whatever, hashing algorithm that I may chose, there's a possibility of
collision and there's no guarantee for a unique long value for a given
unique text value.

Are there any better ways to solve this for a large data set?

Thanks,
Ramu

Reply via email to