Re: How to convert a unique text value to a unique long value for a large data set

2019-04-18 Thread Andrew Musselman
Ramu, sorry for the belated response but if you're still interested you may
want to try the new version of item similarity, which is described some in
this article:
https://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html

Best
Andrew

On Thu, Sep 20, 2018 at 5:10 AM Ramu Ramaiah  wrote:

> Hi,
> I am using the Apache Mahout's
> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> with the input options
>
> 1. --booleanData
> 2. --similarityClassname SIMILARITY_LOGLIKELIHOOD
>
> The loglikelihood similarity algorithm expects a numeric input. However, I
> have a textual data. One of the things, I did was to write a trivial
> standalone java program to convert the unique text value to a unique long
> value, which does the following.
>
> 1. Maintain a Map such that key is the unique text value and the value is
> the unique long value. Map.
> 2. Before we insert the key, we can lookup the Map, if a key-exists, do not
> create a new Long value. If a key does not exist, increment the counter
> value and insert it to the Map.
>
> However, for large data sets, this may have a limitation since the map size
> grows with the number of unique text values.
>
> There are couple of ways to do this
>
> 1. Create a database table, with a constraint of unique text value ( a
> primary key). Query the table before inserting a new long value. I am
> guessing, this may be slow.
> 2. Whatever, hashing algorithm that I may chose, there's a possibility of
> collision and there's no guarantee for a unique long value for a given
> unique text value.
>
> Are there any better ways to solve this for a large data set?
>
> Thanks,
> Ramu
>


Fwd: How to convert a unique text value to a unique long value for a large data set

2018-09-20 Thread Ramu Ramaiah
Hi,
I am using the Apache Mahout's
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
with the input options

1. --booleanData
2. --similarityClassname SIMILARITY_LOGLIKELIHOOD

The loglikelihood similarity algorithm expects a numeric input. However, I
have a textual data. One of the things, I did was to write a trivial
standalone java program to convert the unique text value to a unique long
value, which does the following.

1. Maintain a Map such that key is the unique text value and the value is
the unique long value. Map.
2. Before we insert the key, we can lookup the Map, if a key-exists, do not
create a new Long value. If a key does not exist, increment the counter
value and insert it to the Map.

However, for large data sets, this may have a limitation since the map size
grows with the number of unique text values.

There are couple of ways to do this

1. Create a database table, with a constraint of unique text value ( a
primary key). Query the table before inserting a new long value. I am
guessing, this may be slow.
2. Whatever, hashing algorithm that I may chose, there's a possibility of
collision and there's no guarantee for a unique long value for a given
unique text value.

Are there any better ways to solve this for a large data set?

Thanks,
Ramu