I think we need to take a closer look at your input to RowSimilarityJob,
can you dump it? Could you also give us the parameters you're calling
the Mahout jobs with?
--sebastian
On 05.05.2011 13:17, Paul, Seby wrote:
Hi,
I am trying to find similar documents using mahout rowsimilarity job, I
have 7 small documents in test set. There are no common words between
document 2 and 3, but the output shows that they are exactly similar
based on the following output.
0 elts: {0:0.9999999999999999, 1:1.0, 4:1.0, 5:1.0, 6:1.0}
1 elts: {0:1.0, 1:0.9999999999999999, 4:1.0, 5:1.0, 6:1.0}
2 elts: {2:1.0, 3:1.0}
3 elts: {2:1.0, 3:1.0}
4 elts: {0:1.0, 1:1.0, 4:1.0, 5:1.0, 6:1.0}
5 elts: {0:1.0, 1:1.0, 4:1.0, 5:1.0, 6:1.0}
6 elts: {0:1.0, 1:1.0, 4:1.0, 5:1.0, 6:0.9999999999999999}
I executed the following commands to generate the above output.
Step 1: bin/mahout seqdirectory - converted to sequential file format
Step 2 : mahout seq2sparse - converted to vector format
Step 3: bin/mahout rowed - converted into matrix format
Step 4 : bin/mahout rowsimilarity - computed row similarity
Step 5: bin/mahout vectordump - converted output to readable format
Please help me how to fix this issue.
Thank you for your help in advance.
Seby Paul