Sebastian, I found the issue, the vector was not constructed with all words in the document due to default minimum support (2). There were 9 documents in my test set, and two documents were not converted to vector due to minimum support constraint. I can see correct results after specified the minimum threshold as 1.
Thank you Seby Paul -----Original Message----- From: Sebastian Schelter [mailto:[email protected]] Sent: Thursday, May 05, 2011 7:00 AM To: [email protected] Subject: Re: similar documents using mahout rowsimilarity job I think we need to take a closer look at your input to RowSimilarityJob, can you dump it? Could you also give us the parameters you're calling the Mahout jobs with? --sebastian On 05.05.2011 13:17, Paul, Seby wrote: > Hi, > > > > I am trying to find similar documents using mahout rowsimilarity job, I > have 7 small documents in test set. There are no common words between > document 2 and 3, but the output shows that they are exactly similar > based on the following output. > > > > > > 0 elts: {0:0.9999999999999999, 1:1.0, 4:1.0, 5:1.0, 6:1.0} > > 1 elts: {0:1.0, 1:0.9999999999999999, 4:1.0, 5:1.0, 6:1.0} > > 2 elts: {2:1.0, 3:1.0} > > 3 elts: {2:1.0, 3:1.0} > > 4 elts: {0:1.0, 1:1.0, 4:1.0, 5:1.0, 6:1.0} > > 5 elts: {0:1.0, 1:1.0, 4:1.0, 5:1.0, 6:1.0} > > 6 elts: {0:1.0, 1:1.0, 4:1.0, 5:1.0, 6:0.9999999999999999} > > > > I executed the following commands to generate the above output. > > > > Step 1: bin/mahout seqdirectory - converted to sequential file format > > Step 2 : mahout seq2sparse - converted to vector format > > Step 3: bin/mahout rowed - converted into matrix format > > Step 4 : bin/mahout rowsimilarity - computed row similarity > > Step 5: bin/mahout vectordump - converted output to readable format > > > > Please help me how to fix this issue. > > > > Thank you for your help in advance. > > > > Seby Paul > > > >
