Hello, I'd like to calculate the score of product similarity from the following INPUT FILE. The number of lines in the INPUT FILE is around 300 million. I want to do ItemSimilarityJob to INPUT FILE every day. (The detail of command line is as follows.)
The number of lines in the INPUT FILE is increasing everyday. I don't want to calculate the whole INPUT FILE because we have to take a lot of time for calculating the data. Could you please advise me how to calculate incremental data in INPUT FILE without taking many time? Is it possible to calculate incremental data in ItemSimilarityJob ? Thanks [Command] ----------------------------------------------------------------- hadoop jar /usr/lib/mahout/mahout-core-0.7-cdh4.2.1-job.jar org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob -i INPUT_FILE_PATH -o OUTPUT_FILE_PATH --tempDir TEMP_DIR -b TRUE -m 100000 -s SIMILARITY_TANIMOTO_COEFFICIENT ----------------------------------------------------------------- [INPUT FILE] UserID,Product ID -------------------- 1,10000 1,10010 1,10020 2,20000 2,20020 3,20000 3,10010 4,20000 4,11000 4,22000 .... -------------------- [OUTPUT FILE] ProductID,ProductID,The score of similarity ------------------------------------- 10000 10010 0.003048780487804878 10000 10020 0.0035335689045936395 20000 20020 0.0027624309392265192 20000 22000 0.018518518518518517 .... ------------------------------------- Regards Tomomichi Takiguchi