I think we need to take a closer look at your input to RowSimilarityJob, can you dump it? Could you also give us the parameters you're calling the Mahout jobs with?

--sebastian

On 05.05.2011 13:17, Paul, Seby wrote:
Hi,



I am trying to find similar documents using mahout rowsimilarity job, I
have 7 small documents in test set.  There are no common words between
document 2 and 3, but the output shows that they are exactly similar
based on the following output.





0       elts: {0:0.9999999999999999, 1:1.0, 4:1.0, 5:1.0, 6:1.0}

1       elts: {0:1.0, 1:0.9999999999999999, 4:1.0, 5:1.0, 6:1.0}

2       elts: {2:1.0, 3:1.0}

3       elts: {2:1.0, 3:1.0}

4       elts: {0:1.0, 1:1.0, 4:1.0, 5:1.0, 6:1.0}

5       elts: {0:1.0, 1:1.0, 4:1.0, 5:1.0, 6:1.0}

6       elts: {0:1.0, 1:1.0, 4:1.0, 5:1.0, 6:0.9999999999999999}



I executed the following commands to generate the above output.



Step 1: bin/mahout seqdirectory - converted to sequential file format

Step 2 : mahout seq2sparse  - converted to vector format

Step 3: bin/mahout rowed   - converted into matrix format

Step 4 : bin/mahout rowsimilarity - computed row similarity

Step 5:  bin/mahout vectordump  - converted output to readable format



Please help me how to fix this issue.



Thank you for your help in advance.



Seby Paul





Reply via email to