Hello,

I'd like to calculate the score of product similarity from the following
INPUT FILE.
The number of lines in the INPUT FILE is around 300 million.
I want to do ItemSimilarityJob to INPUT FILE every day.
(The detail of command line is as follows.)

The number of lines in the INPUT FILE is increasing everyday.
I don't want to calculate the whole INPUT FILE because we have to take a
lot of time for calculating the data.

Could you please advise me how to calculate incremental data in INPUT FILE
without taking many time?
Is it possible to calculate incremental data in ItemSimilarityJob ?

Thanks


[Command]
-----------------------------------------------------------------
hadoop jar /usr/lib/mahout/mahout-core-0.7-cdh4.2.1-job.jar
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
-i INPUT_FILE_PATH
-o OUTPUT_FILE_PATH
--tempDir TEMP_DIR
-b TRUE
-m 100000
-s SIMILARITY_TANIMOTO_COEFFICIENT
-----------------------------------------------------------------

[INPUT FILE]

UserID,Product ID
--------------------
1,10000
1,10010
1,10020
2,20000
2,20020
3,20000
3,10010
4,20000
4,11000
4,22000
....
--------------------

[OUTPUT FILE]

ProductID,ProductID,The score of similarity
-------------------------------------
10000   10010   0.003048780487804878
10000   10020   0.0035335689045936395
20000   20020   0.0027624309392265192
20000   22000   0.018518518518518517
....
-------------------------------------


Regards
Tomomichi Takiguchi

Reply via email to